apache/spark

4.8

Adjusted Score

4.8

Raw Score

100%

Time Factor

2026-07-13

Last Push

43.6K

Stars

Scala

Language

3.0M

Lines of Code

12.3K

Files

8.8K

Pattern Hits

2026-07-14

Scan Date

0.06

HC Hit Rate

What These Metrics Mean

Adjusted Score: Primary synthetic code indicator. Raw score normalised per 1,000 lines of code and multiplied by the temporal discount factor. This is the definitive comparative metric — use it to rank repositories by AI authorship density.
Raw Score: The unmodified sum of all severity-weighted, context-multiplied pattern match scores before temporal discounting. Reflects the absolute signal strength independent of when the repository was last active.
Time Factor: The temporal discount multiplier (0–100%) applied to the raw score. Repositories last updated before ChatGPT's launch (Nov 2022) receive a 5% factor. Full signal is only assigned to repositories active in the post-adoption era (Jan 2024+).
Pattern Hits: Total count of individual pattern matches across all files and categories. A high hit count with a low score may indicate a very large codebase with isolated AI snippets; a low count with a high score indicates dense, concentrated AI signatures.
HC Hit Rate: High+Critical pattern hits per file, averaged across the repository. This orthogonal signal catches repositories where a few files are densely packed with high-severity AI tells — a strong indicator even when the normalised score appears moderate due to codebase size.
Lines of Code / Files: Total lines and files analysed. The scanner examines 94 file extensions. These denominators are used to normalise the score, enabling fair comparison between repositories of vastly different sizes.

Score History

This chart maps the temporal evolution of the adjusted synthetic code score across successive scan runs. An upward trajectory indicates ongoing incorporation of AI-generated code or expanding LLM-assisted scaffolding; a stable or declining trajectory may reflect active human refactoring, code removal, or the adoption of stricter authorship policies. The dashed secondary line (right axis) independently tracks total raw pattern hit count, which can diverge from the normalised score when codebase size changes significantly between scans.

Severity Breakdown

Classifies detected patterns by their diagnostic confidence and structural impact. CRITICAL patterns (coefficient 10) represent definitive synthetic signatures — hallucinated imports, explicit LLM attribution metadata — virtually never produced by human authors. HIGH (5) indicates strong structural tells such as cross-file repetition or cross-linguistic idioms. MEDIUM (2) covers recognisable conversational padding and AI-specific vocabulary. LOW (1) captures subtle indicators like tautological comments and generic boilerplate that require density to carry independent signal.

CRITICAL 434HIGH 324MEDIUM 320LOW 7692

Directory Score Breakdown

This horizontal bar chart decomposes the repository's raw synthetic code score by top-level directory, allowing you to pinpoint precisely which modules or components carry the highest AI authorship density. Directories with disproportionately high scores relative to their size warrant targeted manual review: concentrated AI signatures often trace back to mass-generated configuration layers, auto-ported test suites, LLM-scaffolded boilerplate classes, or entire subsystems authored under heavy copilot assistance. Use this view to prioritise your human code-review effort.

Pattern Findings

The scanner identified 8770 distinct pattern matches across 23 syntactic categories. Each entry below represents a discrete location in the source code where the engine recorded a statistically significant AI authorship indicator. Expand any category row to inspect the individual file paths, line numbers, code snippets, and the lexical context (CODE, COMMENT, or STRING) in which each match was detected.

Reading the findings table: The Severity column indicates the diagnostic confidence level (CRITICAL / HIGH / MEDIUM / LOW). The Context column identifies whether the match occurred inside executable code, an inline comment, or a string literal — comment-context matches receive a ×1.5 weight because LLMs systematically over-annotate. The ⚡ bolt icon marks clustered matches: three or more patterns within a 10-line window, each receiving an additional ×1.5 density multiplier as dense clusters constitute far stronger evidence of synthetic authorship than isolated hits.

Hallucination Indicators434 hits · 4672 pts

Severity	File	Line	Snippet	Context
CRITICAL	…rg/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala	137	case e: Throwable if org.apache.commons.lang3.exception.ExceptionUtils.indexOfThrowable(	STRING
CRITICAL	…cheduler/cluster/YarnClientSchedulerBackendSuite.scala	55	Some(org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(0L, 1)))	CODE
CRITICAL⚡	…org/apache/spark/deploy/k8s/KubernetesUtilsSuite.scala	49	assert(sparkPod.pod.getSpec.getContainers.asScala.toList.map(_.getName) == List("first"))	CODE
CRITICAL⚡	…org/apache/spark/deploy/k8s/KubernetesUtilsSuite.scala	56	assert(sparkPod.pod.getSpec.getContainers.asScala.toList.map(_.getName) == List("second"))	CODE
CRITICAL⚡	…org/apache/spark/deploy/k8s/KubernetesUtilsSuite.scala	63	assert(sparkPod.pod.getSpec.getContainers.asScala.toList.map(_.getName) == List("second"))	CODE
CRITICAL	…k/scheduler/cluster/k8s/DeploymentAllocatorSuite.scala	134	assert(deployment.getSpec.getTemplate.getSpec.getContainers.asScala.exists(	CODE
CRITICAL	…heduler/cluster/k8s/ExecutorPVCResizePluginSuite.scala	187	captor.getValue.getSpec.getResources.getRequests.get("storage")).longValue()	CODE
CRITICAL	…k/scheduler/cluster/k8s/StatefulSetPodsAllocator.scala	171	val statefulSet = new io.fabric8.kubernetes.api.model.apps.StatefulSetBuilder()	CODE
CRITICAL	…rverExpectations/stage_with_summaries_expectation.json	5	"details" : "org.apache.spark.sql.Dataset.foreach(Dataset.scala:2862)\n$line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<	CODE
CRITICAL	…ectations/stage_with_accumulable_json_expectation.json	9	"details" : "org.apache.spark.rdd.RDD.foreach(RDD.scala:765)\n$line9.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:15)\n$	CODE
CRITICAL	…ions/stage_list_with_accumulable_json_expectation.json	9	"details" : "org.apache.spark.rdd.RDD.foreach(RDD.scala:765)\n$line9.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:15)\n$	CODE
CRITICAL	…oryServerExpectations/stage_list_json_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.count(RDD.scala:910)\n$line19.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:17)\n$l	CODE
CRITICAL	…oryServerExpectations/stage_list_json_expectation.json	87	"details" : "org.apache.spark.rdd.RDD.count(RDD.scala:910)\n$line11.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)\n$l	CODE
CRITICAL	…oryServerExpectations/stage_list_json_expectation.json	93	"failureReason" : "Job aborted due to stage failure: Task 3 in stage 2.0 failed 1 times, most recent failure: Lost tas	CODE
CRITICAL	…oryServerExpectations/stage_list_json_expectation.json	170	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:271)\n$line10.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:14)\n$lin	CODE
CRITICAL	…oryServerExpectations/stage_list_json_expectation.json	252	"details" : "org.apache.spark.rdd.RDD.count(RDD.scala:910)\n$line9.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:15)\n$li	CODE
CRITICAL	…ctations/stage_list_with_peak_metrics_expectation.json	5	"details" : "org.apache.spark.sql.Dataset.foreach(Dataset.scala:2862)\n$line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<	CODE
CRITICAL	…erExpectations/one_stage_attempt_json_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:271)\n$line10.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:14)\n$lin	CODE
CRITICAL	…attempt_json_details_with_failed_task_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:271)\n$line10.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:14)\n$lin	CODE
CRITICAL	…toryServerExpectations/one_stage_json_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:271)\n$line10.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:14)\n$lin	CODE
CRITICAL	…Expectations/complete_stage_list_json_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.count(RDD.scala:910)\n$line19.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:17)\n$l	CODE
CRITICAL	…Expectations/complete_stage_list_json_expectation.json	87	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:271)\n$line10.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:14)\n$lin	CODE
CRITICAL	…Expectations/complete_stage_list_json_expectation.json	169	"details" : "org.apache.spark.rdd.RDD.count(RDD.scala:910)\n$line9.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:15)\n$li	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	5	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	71	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	137	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	203	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	269	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	335	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	401	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	467	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	533	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…xpectations/stage_task_list_w__status_expectation.json	599	"errorMessage" : "java.lang.RuntimeException: bad exec\n\tat $line16.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.	CODE
CRITICAL	…pectations/excludeOnFailure_for_stage_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:370)\n$line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<consol	CODE
CRITICAL	…pectations/excludeOnFailure_for_stage_expectation.json	176	"errorMessage" : "java.lang.RuntimeException: Bad executor\n\tat $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$an	CODE
CRITICAL	…pectations/excludeOnFailure_for_stage_expectation.json	441	"errorMessage" : "java.lang.RuntimeException: Bad executor\n\tat $line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$an	CODE
CRITICAL	…ations/stage_with_speculation_summary_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.collect(RDD.scala:1029)\n$line17.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<c	CODE
CRITICAL	…rExpectations/stage_with_peak_metrics_expectation.json	5	"details" : "org.apache.spark.sql.Dataset.foreach(Dataset.scala:2862)\n$line19.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<	CODE
CRITICAL	…ectations/one_stage_json_with_details_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:271)\n$line10.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:14)\n$lin	CODE
CRITICAL	…tions/one_stage_json_with_partitionId_expectation.json	5	"details" : "org.apache.spark.sql.Dataset.count(Dataset.scala:3130)\n$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<in	CODE
CRITICAL	…erExpectations/failed_stage_list_json_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.count(RDD.scala:910)\n$line11.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)\n$l	CODE
CRITICAL	…erExpectations/failed_stage_list_json_expectation.json	11	"failureReason" : "Job aborted due to stage failure: Task 3 in stage 2.0 failed 1 times, most recent failure: Lost tas	CODE
CRITICAL	…tions/excludeOnFailure_node_for_stage_expectation.json	5	"details" : "org.apache.spark.rdd.RDD.map(RDD.scala:370)\n$line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<consol	CODE
CRITICAL	…tions/excludeOnFailure_node_for_stage_expectation.json	371	"errorMessage" : "java.lang.RuntimeException: Bad executor\n\tat $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$an	CODE
CRITICAL	…tions/excludeOnFailure_node_for_stage_expectation.json	834	"errorMessage" : "java.lang.RuntimeException: Bad executor\n\tat $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$an	CODE
CRITICAL	…tions/excludeOnFailure_node_for_stage_expectation.json	901	"errorMessage" : "java.lang.RuntimeException: Bad executor\n\tat $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$an	CODE
CRITICAL	…tions/excludeOnFailure_node_for_stage_expectation.json	968	"errorMessage" : "java.lang.RuntimeException: Bad executor\n\tat $line15.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$an	CODE
CRITICAL	…/src/test/scala/org/apache/spark/ui/UIUtilsSuite.scala	220	val e1 = "Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 i	CODE
CRITICAL	…he/spark/shuffle/sort/ShuffleExternalSorterSuite.scala	109	// at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:384)	COMMENT
CRITICAL	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	721	val rootLogger = org.apache.logging.log4j.LogManager.getRootLogger()	CODE
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1754	// at org.apache.spark.util.UtilsSuite.throwException(UtilsSuite.scala:1529)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1759	// ----> at org.apache.spark.util.UtilsSuite.callGetTryFromNested(UtilsSuite.scala:1626) <---- STITCHED.	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1767	// at org.apache.spark.util.UtilsSuite.callDoTryNested(UtilsSuite.scala:1630)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1771	// at org.apache.spark.util.UtilsSuite.callDoTryNestedNested(UtilsSuite.scala:1654)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1804	// at org.apache.spark.util.UtilsSuite.throwException(UtilsSuite.scala:1529)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1808	// at org.apache.spark.util.UtilsSuite.callDoTry(UtilsSuite.scala:1534)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1813	// ----> at org.apache.spark.util.UtilsSuite.callGetTryFromNestedNested(UtilsSuite.scala:1650) <---- STITCHED.	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1821	// at org.apache.spark.util.UtilsSuite.callDoTryNestedNested(UtilsSuite.scala:1654)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1849	// at org.apache.spark.util.UtilsSuite.throwException(UtilsSuite.scala:1529)	STRING
CRITICAL⚡	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1853	// at org.apache.spark.util.UtilsSuite.callDoTry(UtilsSuite.scala:1534)	STRING
374 more matches not shown…

Hyper-Verbose Identifiers3027 hits · 2825 pts

Severity	File	Line	Snippet	Context
LOW	…la/org/apache/spark/deploy/yarn/YarnClusterSuite.scala	67	private def getOrCreatePyConnectDepChecker(	CODE
LOW	…apache/spark/network/shuffle/ShuffleTestAccessor.scala	136	def getOrCreateAppShufflePartitionInfo(	CODE
LOW⚡	…scala/org/apache/spark/deploy/yarn/YarnAllocator.scala	277	private def getOrUpdateAllocatedHostToContainersMapForRPId(	CODE
LOW⚡	…scala/org/apache/spark/deploy/yarn/YarnAllocator.scala	283	private def getOrUpdateRunningExecutorForRPId(rpId: Int): mutable.Set[String] = synchronized {	CODE
LOW⚡	…scala/org/apache/spark/deploy/yarn/YarnAllocator.scala	287	private def getOrUpdateNumExecutorsStartingForRPId(rpId: Int): AtomicInteger = synchronized {	CODE
LOW⚡	…scala/org/apache/spark/deploy/yarn/YarnAllocator.scala	291	private def getOrUpdateTargetNumExecutorsForRPId(rpId: Int): Int = synchronized {	CODE
LOW	…a/org/apache/spark/storage/DiskBlockManagerSuite.scala	144	private def getAndSetUmask(posix: POSIX, mask: String): String = {	CODE
LOW	…/resources/org/apache/spark/ui/static/executorspage.js	329	function reselectCheckboxesBasedOnTaskTableState() {	CODE
LOW	…resources/org/apache/spark/ui/static/streaming-page.js	62	function getMaxMarginLeftForTimeline() {	CODE
LOW	…resources/org/apache/spark/ui/static/streaming-page.js	69	function getOnClickTimelineFunction() {	CODE
LOW	…/resources/org/apache/spark/ui/static/timeline-view.js	171	function getStageIdAndAttemptForStageEntry(baseElem) {	CODE
LOW	…/resources/org/apache/spark/ui/static/timeline-view.js	239	function drawTaskAssignmentTimeline(groupArray, eventObjArray, minLaunchTime, maxFinishTime, offset) {	CODE
LOW	…main/resources/org/apache/spark/ui/static/stagepage.js	120	function getColumnNameForTaskMetricSummary(columnKey) {	CODE
LOW	…main/resources/org/apache/spark/ui/static/stagepage.js	175	function displayRowsForSummaryMetricsTable(row, type, columnIndex) {	CODE
LOW	…main/resources/org/apache/spark/ui/static/stagepage.js	218	function createDataTableForTaskSummaryMetricsTable(taskSummaryMetricsTable) {	CODE
LOW	…main/resources/org/apache/spark/ui/static/stagepage.js	277	function createRowMetadataForColumn(colKey, data, checkboxId) {	CODE
LOW	…main/resources/org/apache/spark/ui/static/stagepage.js	287	function reselectCheckboxesBasedOnTaskTableState() {	CODE
LOW	…esources/org/apache/spark/ui/static/environmentpage.js	47	function createRESTEndPointForEnvironmentPage(appId) {	CODE
LOW	…/resources/org/apache/spark/ui/static/spark-dag-viz.js	232	function getMaxChildWidthAndPaddingTop(g, v, svg) {	CODE
LOW	…src/main/resources/org/apache/spark/ui/static/table.js	52	function expandAllThreadStackTrace(toggleButton) {	CODE
LOW	…src/main/resources/org/apache/spark/ui/static/table.js	66	function collapseAllThreadStackTrace(toggleButton) {	CODE
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	1081	*/function injectEdgeLabelProxies(g){_.forEach(g.edges(),function(e){var edge=g.edge(e);if(edge.width&&edge.height){var	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	1325	*/function findSmallestWidthAlignment(g,xss){return _.minBy(_.values(xss),function(xs){var max=Number.NEGATIVE_INFINITY	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	360	function cartesianNormalizeInPlace(d){var l=sqrt(d[0]d[0]+d[1]d[1]+d[2]*d[2]);d[0]/=l,d[1]/=l,d[2]/=l}var lambda0$1,ph	CODE
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	439	}}}function clipAntimeridianIntersect(lambda0,phi0,lambda1,phi1){var cosPhi0,cosPhi1,sinLambda0Lambda1=sin(lambda0-lambd	CODE
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	891	percentRe=/^%/,requoteRe=/[\\^$*+?\|[\]().{}]/g;function pad(value,fill,width){var sign=value<0?"-":"",string=(sign?-valu	CODE
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	1298	scanPos=0,prevLayerLength=prevLayer.length,lastNode=_.last(layer);_.forEach(layer,function(v,i){var w=findOtherInnerSegm	CODE
LOW	…src/main/resources/org/apache/spark/ui/static/utils.js	188	function createRESTEndPointForExecutorsPage(appId) {	CODE
LOW	…src/main/resources/org/apache/spark/ui/static/utils.js	211	function createRESTEndPointForMiscellaneousProcess(appId) {	CODE
LOW	core/src/main/scala/org/apache/spark/SparkContext.scala	3070	def getOrCreate(config: SparkConf): SparkContext = {	STRING
LOW	core/src/main/scala/org/apache/spark/SparkContext.scala	3094	def getOrCreate(): SparkContext = {	STRING
LOW	core/src/main/scala/org/apache/spark/util/Utils.scala	750	private[spark] def getOrCreateLocalRootDirs(conf: ReadOnlySparkConf): Array[String] = {	CODE
LOW	core/src/main/scala/org/apache/spark/util/Utils.scala	785	private def getOrCreateLocalRootDirsImpl(conf: ReadOnlySparkConf): Array[String] = {	CODE
LOW	…cala/org/apache/spark/util/UninterruptibleThread.scala	61	def getAndSetUninterruptible(value: Boolean): Boolean = synchronized {	CODE
LOW	…c/main/scala/org/apache/spark/util/AccumulatorV2.scala	486	private def getOrCreate = {	CODE
LOW	…a/org/apache/spark/deploy/master/ApplicationInfo.scala	82	private[deploy] def getOrUpdateExecutorsForRPId(rpId: Int): mutable.Set[Int] = {	CODE
LOW	…in/scala/org/apache/spark/scheduler/DAGScheduler.scala	602	private def getOrCreateShuffleMapStage(	CODE
LOW	…in/scala/org/apache/spark/scheduler/DAGScheduler.scala	802	private def getOrCreateParentStages(shuffleDeps: HashSet[ShuffleDependency[_, _, _]],	CODE
LOW	…/scala/org/apache/spark/status/AppStatusListener.scala	1145	private def getOrCreateExecutor(executorId: String, addTime: Long): LiveExecutor = {	CODE
LOW	…/scala/org/apache/spark/status/AppStatusListener.scala	1152	private def getOrCreateOtherProcess(processId: String,	CODE
LOW	…/scala/org/apache/spark/status/AppStatusListener.scala	1212	private def getOrCreateStage(info: StageInfo): LiveStage = {	CODE
LOW	core/src/main/scala/org/apache/spark/rdd/RDD.scala	369	private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =	CODE
LOW	core/src/main/scala/org/apache/spark/rdd/RDD.scala	381	private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {	CODE
LOW	…main/scala/org/apache/spark/storage/BlockManager.scala	1409	def getOrElseUpdateRDDBlock[T](	CODE
LOW	…main/scala/org/apache/spark/storage/BlockManager.scala	1432	private def getOrElseUpdate[T](	CODE
LOW	…g/apache/spark/api/python/PythonWorkerLogCapture.scala	97	private def getOrCreateLogWriter(workerId: String): (RollingLogWriter, AtomicLong) = {	CODE
LOW	…in/scala/org/apache/spark/resource/ResourceUtils.scala	323	def getOrDiscoverAllResources(	CODE
LOW	…in/scala/org/apache/spark/resource/ResourceUtils.scala	356	def getOrDiscoverAllResourcesForResourceProfile(	CODE
LOW	…/scala/org/apache/spark/resource/ResourceProfile.scala	376	private[spark] def getOrCreateDefaultProfile(conf: SparkConf): ResourceProfile = {	CODE
LOW	python/run-tests.py	236	def run_individual_python_test(target_dir, test_name, pyspark_python, keep_test_output):	CODE
LOW	python/run-tests.py	398	def get_default_python_executables():	CODE
LOW	python/pyspark/worker.py	140	def use_legacy_pandas_udf_conversion(self) -> bool:	CODE
LOW	python/pyspark/worker.py	147	def use_legacy_pandas_udtf_conversion(self) -> bool:	CODE
LOW	python/pyspark/worker.py	162	def int_to_decimal_coercion_enabled(self) -> bool:	CODE
LOW	python/pyspark/worker.py	180	def arrow_max_records_per_batch(self) -> int:	CODE
LOW	python/pyspark/worker.py	184	def arrow_max_bytes_per_batch(self) -> int:	CODE
LOW	python/pyspark/worker.py	344	def verify_iterator_exhausted(iterator: Iterator, error_class: str) -> None:	CODE
LOW	python/pyspark/worker.py	504	def wrap_grouped_transform_with_state_pandas_init_state_udf(f, return_type, runner_conf):	CODE
LOW	python/pyspark/worker.py	524	def wrap_grouped_transform_with_state_udf(f, return_type, runner_conf):	CODE
LOW	python/pyspark/worker.py	536	def wrap_grouped_transform_with_state_init_state_udf(f, return_type, runner_conf):	CODE
2967 more matches not shown…

Over-Commented Block2918 hits · 2735 pts

Severity	File	Line	Snippet	Context
LOW	.asf.yaml	1	# Licensed to the Apache Software Foundation (ASF) under one or more	COMMENT
LOW	.pre-commit-config.yaml	1	#	COMMENT
LOW	pyproject.toml	1	#	COMMENT
LOW	…rg/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala	101	yarnConf.setInt("yarn.scheduler.capacity.root.default.maximum-capacity", 100)	COMMENT
LOW	…rg/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala	141	}	COMMENT
LOW	…scala/org/apache/spark/deploy/yarn/YarnAllocator.scala	321	ResourceProfile.getResourcesForClusterManager(rp.id, rp.executorResources,	COMMENT
LOW	…c/main/scala/org/apache/spark/deploy/yarn/Client.scala	801	// Update the configuration with all the distributed files, minus the conf archive. The	COMMENT
LOW	…/kubernetes/docker/src/main/dockerfiles/spark/decom.sh	1	#!/usr/bin/env bash	COMMENT
LOW	…rnetes/docker/src/main/dockerfiles/spark/entrypoint.sh	1	#!/usr/bin/env bash	COMMENT
LOW	…s/core/src/test/resources/driver-podgroup-template.yml	1	#	COMMENT
LOW	…er/cluster/k8s/KubernetesClusterSchedulerBackend.scala	301	running.delete()	COMMENT
LOW	…/kubernetes/integration-tests/tests/pyfiles_connect.py	1	#	COMMENT
LOW	…/kubernetes/integration-tests/tests/decommissioning.py	1	#	COMMENT
LOW	…managers/kubernetes/integration-tests/tests/pyfiles.py	1	#	COMMENT
LOW	…ernetes/integration-tests/tests/worker_memory_check.py	1	#	COMMENT
LOW	…ernetes/integration-tests/tests/py_container_checks.py	1	#	COMMENT
LOW	…tes/integration-tests/tests/decommissioning_cleanup.py	1	#	COMMENT
LOW	…tes/integration-tests/tests/python_executable_check.py	1	#	COMMENT
LOW	…nagers/kubernetes/integration-tests/tests/autoscale.py	1	#	COMMENT
LOW	…ntegration-tests/scripts/setup-integration-test-env.sh	1	#!/usr/bin/env bash	COMMENT
LOW	…agers/kubernetes/integration-tests/dev/spark-rbac.yaml	1	#	COMMENT
LOW	…tes/integration-tests/dev/dev-run-integration-tests.sh	1	#!/usr/bin/env bash	COMMENT
LOW	…tegration-tests/src/test/resources/driver-template.yml	1	#	COMMENT
LOW	…gration-tests/src/test/resources/executor-template.yml	1	#	COMMENT
LOW	…-tests/src/test/resources/driver-schedule-template.yml	1	#	COMMENT
LOW	…st/resources/volcano/high-priority-driver-template.yml	1	#	COMMENT
LOW	…rces/volcano/low-priority-driver-podgroup-template.yml	1	#	COMMENT
LOW	…sources/volcano/driver-podgroup-template-memory-3g.yml	1	#	COMMENT
LOW	…/resources/volcano/queue0-driver-podgroup-template.yml	1	#	COMMENT
LOW	…n-tests/src/test/resources/volcano/priorityClasses.yml	1	#	COMMENT
LOW	…/resources/volcano/queue1-driver-podgroup-template.yml	1	#	COMMENT
LOW	…/resources/volcano/medium-priority-driver-template.yml	1	#	COMMENT
LOW	…t/resources/volcano/queue-driver-podgroup-template.yml	1	#	COMMENT
LOW	…s/volcano/medium-priority-driver-podgroup-template.yml	1	#	COMMENT
LOW	…est/resources/volcano/low-priority-driver-template.yml	1	#	COMMENT
LOW	…ces/volcano/high-priority-driver-podgroup-template.yml	1	#	COMMENT
LOW	…/org/apache/spark/launcher/AbstractCommandBuilder.java	201	if (isBeeLine && "1".equals(getenv("SPARK_CONNECT_BEELINE")) &&	COMMENT
LOW	…/org/apache/spark/launcher/AbstractCommandBuilder.java	321	return scala;	COMMENT
LOW	…he/spark/shuffle/sort/ShuffleExternalSorterSuite.scala	101	// same memory page. When a task reads memory written by another task, many types of failures	COMMENT
LOW	…t/scala/org/apache/spark/util/SizeEstimatorSuite.scala	301	// objectSize=8, fields=12 => shellSize=20, aligned to 24	COMMENT
LOW	…t/scala/org/apache/spark/util/SizeEstimatorSuite.scala	361	// DummyString has: pointer(arr,8) + Int(hashCode,4) + Int(hash32,4) = 16 bytes of fields	COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1661	callGetTry(t)	COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1681	assert(st1.exists(_.getMethodName == "callGetTry"))	COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1701		COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1741	Utils.doTryWithCallerStacktrace {	COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1761	// at org.scalatest.Assertions.intercept(Assertions.scala:749)	COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1801	//	COMMENT
LOW	…/src/test/scala/org/apache/spark/util/UtilsSuite.scala	1841		COMMENT
LOW	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1041	//	COMMENT
LOW	…la/org/apache/spark/scheduler/HealthTrackerSuite.scala	361	// This ensures that we don't trigger spurious excluding for long tasksets, when the taskset	COMMENT
LOW	…rg/apache/spark/scheduler/TaskSchedulerImplSuite.scala	1161	// We should be checking our node excludelist, but it should be within the bound we defined	COMMENT
LOW	…rg/apache/spark/scheduler/TaskSchedulerImplSuite.scala	2541		COMMENT
LOW	…/test/scala/org/apache/spark/scheduler/PoolSuite.scala	121	scheduleTaskAndVerifyId(0, rootPool, 0)	COMMENT
LOW	…st/scala/org/apache/spark/executor/ExecutorSuite.scala	101	}	COMMENT
LOW	…sources/org/apache/spark/ui/static/graphlib-dot.min.js	141	// Label for the graph itself	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	201	h=s?Math.atan2(k,bl)*rad2deg-120:NaN;return new Cubehelix(h<0?h+360:h,s,l,o.opacity)}function cubehelix(h,s,l,opacity){r	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	281	// Limit forces for very close nodes; randomize direction if coincident.	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	301	function formatTrim(s){out:for(var n=s.length,i=1,i0=-1,i1;i<n;++i){switch(s[i]){case".":i0=i1=i;break;case"0":if(i0===0	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	321	// Perform the initial formatting.	COMMENT
LOW	…n/resources/org/apache/spark/ui/static/dagre-d3.min.js	341	(function(global,factory){typeof exports==="object"&&typeof module!=="undefined"?factory(exports,require("d3-array")):ty	COMMENT
2858 more matches not shown…

Cross-File Repetition182 hits · 910 pts

Severity	File	Line	Snippet	Context
HIGH	…c/main/scala/org/apache/spark/deploy/SparkSubmit.scala	0	welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version %s /_/	STRING
HIGH	…/src/main/scala/org/apache/spark/repl/SparkILoop.scala	0	welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version %s /_/	STRING
HIGH	python/pyspark/shell.py	0	welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version %s /_/	STRING
HIGH	python/pyspark/core/rdd.py	0	return an javardd of object by unpickling it will convert each python object into java object by pickle, whenever the rd	STRING
HIGH	python/pyspark/mllib/common.py	0	return an javardd of object by unpickling it will convert each python object into java object by pickle, whenever the rd	STRING
HIGH	python/pyspark/ml/common.py	0	return an javardd of object by unpickling it will convert each python object into java object by pickle, whenever the rd	STRING
HIGH	python/pyspark/tests/test_rdd.py	0	executes a job with the group ``job_group``. each job waits for 3 seconds and then exits.	STRING
HIGH	python/pyspark/tests/test_pin_thread.py	0	executes a job with the group ``job_group``. each job waits for 3 seconds and then exits.	STRING
HIGH	python/pyspark/sql/tests/test_job_cancellation.py	0	executes a job with the group ``job_group``. each job waits for 3 seconds and then exits.	STRING
HIGH	python/pyspark/ml/tree.py	0	trees in this ensemble. warning: these have null parent estimators.	STRING
HIGH	python/pyspark/ml/regression.py	0	trees in this ensemble. warning: these have null parent estimators.	STRING
HIGH	python/pyspark/ml/classification.py	0	trees in this ensemble. warning: these have null parent estimators.	STRING
HIGH	python/pyspark/ml/wrapper.py	0	returns the number of features the model was trained on. if unknown, returns -1	STRING
HIGH	python/pyspark/ml/regression.py	0	returns the number of features the model was trained on. if unknown, returns -1	STRING
HIGH	python/pyspark/ml/base.py	0	returns the number of features the model was trained on. if unknown, returns -1	STRING
HIGH	python/pyspark/ml/connect/base.py	0	returns the number of features the model was trained on. if unknown, returns -1	STRING
HIGH	python/pyspark/ml/classification.py	0	creates a copy of this instance with a randomly generated uid and some extra params. this copies creates a deep copy of	STRING
HIGH	python/pyspark/ml/tuning.py	0	creates a copy of this instance with a randomly generated uid and some extra params. this copies creates a deep copy of	STRING
HIGH	python/pyspark/ml/connect/tuning.py	0	creates a copy of this instance with a randomly generated uid and some extra params. this copies creates a deep copy of	STRING
HIGH	python/pyspark/ml/classification.py	0	given a java trainvalidationsplitmodel, create and return a python wrapper of it. used for ml persistence.	STRING
HIGH	python/pyspark/ml/pipeline.py	0	given a java trainvalidationsplitmodel, create and return a python wrapper of it. used for ml persistence.	STRING
HIGH	python/pyspark/ml/tuning.py	0	given a java trainvalidationsplitmodel, create and return a python wrapper of it. used for ml persistence.	STRING
HIGH	python/pyspark/ml/classification.py	0	transfer this instance to a java trainvalidationsplitmodel. used for ml persistence. returns ------- py4j.java_gateway.j	STRING
HIGH	python/pyspark/ml/pipeline.py	0	transfer this instance to a java trainvalidationsplitmodel. used for ml persistence. returns ------- py4j.java_gateway.j	STRING
HIGH	python/pyspark/ml/tuning.py	0	transfer this instance to a java trainvalidationsplitmodel. used for ml persistence. returns ------- py4j.java_gateway.j	STRING
HIGH	…thon/pyspark/ml/tests/connect/test_connect_function.py	0	these test cases exercise the interface to the proto plan generation but do not call spark.	STRING
HIGH	…hon/pyspark/sql/tests/connect/test_connect_function.py	0	these test cases exercise the interface to the proto plan generation but do not call spark.	STRING
HIGH	python/pyspark/sql/tests/connect/test_connect_plan.py	0	these test cases exercise the interface to the proto plan generation but do not call spark.	STRING
HIGH	python/pyspark/pandas/series.py	0	same as `to_pandas()`, without issuing the advice log for internal usage.	STRING
HIGH	python/pyspark/pandas/frame.py	0	same as `to_pandas()`, without issuing the advice log for internal usage.	STRING
HIGH	python/pyspark/pandas/indexes/multi.py	0	same as `to_pandas()`, without issuing the advice log for internal usage.	STRING
HIGH	python/pyspark/pandas/indexes/base.py	0	same as `to_pandas()`, without issuing the advice log for internal usage.	STRING
HIGH	python/pyspark/pandas/series.py	0	return a pandas index directly from _internal to avoid overhead of copy. this method is for internal use only.	STRING
HIGH	python/pyspark/pandas/frame.py	0	return a pandas index directly from _internal to avoid overhead of copy. this method is for internal use only.	STRING
HIGH	python/pyspark/pandas/indexes/base.py	0	return a pandas index directly from _internal to avoid overhead of copy. this method is for internal use only.	STRING
HIGH	python/pyspark/pandas/sql_formatter.py	0	a standard ``string.formatter`` in python that can understand pyspark instances with basic python objects. this object h	STRING
HIGH	python/pyspark/sql/sql_formatter.py	0	a standard ``string.formatter`` in python that can understand pyspark instances with basic python objects. this object h	STRING
HIGH	python/pyspark/sql/connect/sql_formatter.py	0	a standard ``string.formatter`` in python that can understand pyspark instances with basic python objects. this object h	STRING
HIGH	…pyspark/pandas/tests/data_type_ops/test_num_reverse.py	0	unit tests for arithmetic operations of numeric data types. a few test cases are disabled because pandas-on-spark return	STRING
HIGH	…hon/pyspark/pandas/tests/data_type_ops/test_num_ops.py	0	unit tests for arithmetic operations of numeric data types. a few test cases are disabled because pandas-on-spark return	STRING
HIGH	…park/pandas/tests/data_type_ops/test_num_arithmetic.py	0	unit tests for arithmetic operations of numeric data types. a few test cases are disabled because pandas-on-spark return	STRING
HIGH	python/pyspark/sql/catalog.py	0	creates an external table based on the dataset in a data source. it returns the dataframe associated with the external t	STRING
HIGH	python/pyspark/sql/context.py	0	creates an external table based on the dataset in a data source. it returns the dataframe associated with the external t	STRING
HIGH	python/pyspark/sql/connect/context.py	0	creates an external table based on the dataset in a data source. it returns the dataframe associated with the external t	STRING
HIGH	python/pyspark/sql/dataframe.py	0	returns the names of columns in this :class:`dataframe`. examples -------- >>> df = spark.createdataframe([(2, "alice"),	STRING
HIGH	python/pyspark/sql/classic/dataframe.py	0	returns the names of columns in this :class:`dataframe`. examples -------- >>> df = spark.createdataframe([(2, "alice"),	STRING
HIGH	python/pyspark/sql/connect/dataframe.py	0	returns the names of columns in this :class:`dataframe`. examples -------- >>> df = spark.createdataframe([(2, "alice"),	STRING
HIGH	python/pyspark/sql/udtf.py	0	user-defined function related classes and functions	STRING
HIGH	python/pyspark/sql/connect/udtf.py	0	user-defined function related classes and functions	STRING
HIGH	python/pyspark/sql/udf.py	0	user-defined function related classes and functions	STRING
HIGH	python/pyspark/sql/connect/udf.py	0	user-defined function related classes and functions	STRING
HIGH	python/pyspark/sql/tests/test_tvf.py	0	select * from variant_explode(parse_json('["hello", "world"]'))	STRING
HIGH	…che/spark/sql/DataFrameTableValuedFunctionsSuite.scala	0	select * from variant_explode(parse_json('["hello", "world"]'))	STRING
HIGH	…che/spark/sql/DataFrameTableValuedFunctionsSuite.scala	0	select * from variant_explode(parse_json('["hello", "world"]'))	STRING
HIGH	python/pyspark/sql/tests/test_tvf.py	0	select * from variant_explode(parse_json('{"a": true, "b": 3.14}'))	STRING
HIGH	…che/spark/sql/DataFrameTableValuedFunctionsSuite.scala	0	select * from variant_explode(parse_json('{"a": true, "b": 3.14}'))	STRING
HIGH	…che/spark/sql/DataFrameTableValuedFunctionsSuite.scala	0	select * from variant_explode(parse_json('{"a": true, "b": 3.14}'))	STRING
HIGH	python/pyspark/sql/tests/test_tvf.py	0	select * from variant_explode_outer(parse_json('["hello", "world"]'))	STRING
HIGH	…che/spark/sql/DataFrameTableValuedFunctionsSuite.scala	0	select * from variant_explode_outer(parse_json('["hello", "world"]'))	STRING
HIGH	…che/spark/sql/DataFrameTableValuedFunctionsSuite.scala	0	select * from variant_explode_outer(parse_json('["hello", "world"]'))	STRING
122 more matches not shown…

Cross-Language Confusion138 hits · 662 pts

Severity	File	Line	Snippet	Context
HIGH	python/pyspark/core/rdd.py	242	return self._jrdd.toString()	CODE
HIGH	python/pyspark/mllib/tree.py	90	return self._java_model.toString()	CODE
HIGH	python/pyspark/mllib/tree.py	150	return self._java_model.toString()	CODE
HIGH	python/pyspark/mllib/stat/test.py	64	return self._java_model.toString()	CODE
HIGH⚡	python/pyspark/tests/test_util.py	73	# This attempts java.lang.String(null) which throws an NPE.	COMMENT
HIGH	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	80	decimal, date, timestamp, duration, time, null, and nested types.	STRING
HIGH	…park/tests/upstream/pyarrow/test_pyarrow_array_cast.py	31	- Success: [0, 1, null]@int16 - element values via scalar.as_py() and Arrow type after cast	STRING
HIGH	…park/tests/upstream/pyarrow/test_pyarrow_array_cast.py	111	as "[val1, val2, null]@arrow_type" using each scalar's as_py() value.	STRING
HIGH⚡	…park/tests/upstream/pyarrow/test_pyarrow_array_cast.py	126	On success: "[val1, val2, null]@arrow_type"	STRING
HIGH⚡	…park/tests/upstream/pyarrow/test_pyarrow_array_cast.py	127	e.g. "[0, 1, -1, 127, -128, null]@int16"	STRING
HIGH	python/pyspark/testing/utils.py	371	script = "$(test $(tput colors)) && $(test $(tput colors) -ge 8) && echo true \|\| echo false"	CODE
HIGH⚡	python/pyspark/ml/tests/test_wrapper.py	54	self.assertIn("LinearRegression_", model._java_obj.toString())	CODE
HIGH⚡	python/pyspark/ml/tests/test_wrapper.py	55	self.assertIn("LinearRegressionTrainingSummary", summary._java_obj.toString())	CODE
HIGH⚡	python/pyspark/ml/tests/test_wrapper.py	61	model._java_obj.toString()	CODE
HIGH⚡	python/pyspark/ml/tests/test_wrapper.py	62	self.assertIn("LinearRegressionTrainingSummary", summary._java_obj.toString())	CODE
HIGH	python/pyspark/ml/tests/test_wrapper.py	74	model._java_obj.toString()	CODE
HIGH	python/pyspark/ml/tests/test_wrapper.py	76	summary._java_obj.toString()	CODE
HIGH	python/pyspark/ml/tests/test_functions.py	253	self.assertTrue(df1.equals(df2))	STRING
HIGH	python/pyspark/ml/tests/test_functions.py	259	self.assertFalse(df1.equals(df3))	STRING
HIGH	python/pyspark/ml/tests/test_param.py	261	"inputCol: input column name. (undefined)",	CODE
HIGH	python/pyspark/errors/exceptions/captured.py	240	desc=e.toString(),	CODE
HIGH	python/pyspark/resource/requests.py	321	that the cluster manager doesn't support the result is undefined, it may error or may just	STRING
HIGH	python/pyspark/pandas/series.py	6759	if get_option("compute.eager_check") and not self.index.equals(other.index):	CODE
HIGH	python/pyspark/pandas/utils.py	996	return left._jc.equals(right._jc)	CODE
HIGH	python/pyspark/pandas/frame.py	1714	# \| 2\|[{0, null}, {1, n...\|	COMMENT
HIGH	python/pyspark/pandas/indexing.py	556	cast(ClassicColumn, col)._jc.toString() for col in data_spark_columns	CODE
HIGH	python/pyspark/pandas/groupby.py	1305	Flag to ignore NA(nan/null) values during truth testing.	STRING
HIGH	python/pyspark/pandas/base.py	1464	# If even one StructField is null, that row should be dropped.	COMMENT
HIGH⚡	python/pyspark/pandas/tests/computation/test_combine.py	682	# Only update where new value > 150 (and old is null)	COMMENT
HIGH	python/pyspark/pandas/tests/window/test_rolling_adv.py	42	# pandas 3 returns 0.0 (not null); pandas < 3 returns nan. Both are matched here.	COMMENT
HIGH	…thon/pyspark/pandas/tests/window/test_expanding_adv.py	43	# pandas 3 returns 0.0 (not null); pandas < 3 returns nan. Both are matched here.	COMMENT
HIGH	…hon/pyspark/pandas/tests/diff_frames_ops/test_error.py	198	psidx1.equals(psidx2)	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	201	self.assert_eq(pidx.equals(pidx), psidx.equals(psidx))	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	205	pidx.equals(pd.Index(["a", "b", "c"])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	206	psidx.equals(ps.Index(["a", "b", "c"])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	209	pidx.equals(pd.Index(["b", "b", "a"])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	210	psidx.equals(ps.Index(["b", "b", "a"])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	221	self.assert_eq(pmidx.equals(pmidx), psmidx.equals(psmidx))	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	225	pmidx.equals(pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	226	psmidx.equals(ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	229	pmidx.equals(pd.MultiIndex.from_tuples([("c", "z"), ("b", "y"), ("a", "x")])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	230	psmidx.equals(ps.MultiIndex.from_tuples([("c", "z"), ("b", "y"), ("a", "x")])),	CODE
HIGH⚡	python/pyspark/pandas/tests/indexes/test_basic.py	234	self.assert_eq(pidx.equals(pmidx), psidx.equals(psmidx))	CODE
HIGH	python/pyspark/pandas/indexes/base.py	387	and self.equals(other)	CODE
HIGH⚡	python/pyspark/pandas/indexes/base.py	411	>>> idx.equals(idx)	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	414	... idx.equals(ps.Index(['a', 'b', 'c']))	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	417	... idx.equals(ps.Index(['b', 'b', 'a']))	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	419	>>> idx.equals(midx)	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	424	>>> midx.equals(midx)	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	427	... midx.equals(ps.MultiIndex.from_tuples([('a', 'x'), ('b', 'y'), ('c', 'z')]))	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	430	... midx.equals(ps.MultiIndex.from_tuples([('c', 'z'), ('b', 'y'), ('a', 'x')]))	STRING
HIGH⚡	python/pyspark/pandas/indexes/base.py	432	>>> midx.equals(idx)	STRING
HIGH⚡	python/pyspark/sql/conversion.py	180	if batch.schema.equals(arrow_schema, check_metadata=False):	CODE
HIGH	python/pyspark/sql/types.py	1845	return stringConcat.toString()	CODE
HIGH	python/pyspark/sql/types.py	262	null, UDTs, arrays, structs, and maps."""	STRING
HIGH	python/pyspark/sql/context.py	845	'{"field1" : null, "field2": "row3", "field3":{"field4":33, "field5": []}}',	CODE
HIGH	python/pyspark/sql/group.py	76	jvm_string = self._jgd.toString()	CODE
HIGH	python/pyspark/sql/tvf.py	427	Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced.	STRING
HIGH	python/pyspark/sql/tvf.py	566	null, and any other variant values.	STRING
HIGH	python/pyspark/sql/tvf.py	631	SQL NULL, variant null, and any other variant values, then NULL is produced.	STRING
78 more matches not shown…

Unused Imports612 hits · 562 pts

Severity	File	Line	Context
LOW	python/run-tests.py	45	CODE
LOW	python/packaging/connect/pyspark_connect/__init__.py	22	CODE
LOW	python/pyspark/worker.py	52	CODE
LOW	python/pyspark/util.py	60	CODE
LOW	python/pyspark/util.py	62	CODE
LOW	python/pyspark/util.py	63	CODE
LOW	python/pyspark/util.py	65	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	66	CODE
LOW	python/pyspark/util.py	91	CODE
LOW	python/pyspark/util.py	91	CODE
LOW	python/pyspark/util.py	91	CODE
LOW	python/pyspark/util.py	91	CODE
LOW	python/pyspark/util.py	91	CODE
LOW	python/pyspark/util.py	98	CODE
LOW	python/pyspark/util.py	929	CODE
LOW	python/pyspark/util.py	949	CODE
LOW	python/pyspark/conf.py	27	CODE
LOW	python/pyspark/shell.py	33	CODE
LOW	python/pyspark/__init__.py	68	CODE
LOW	python/pyspark/__init__.py	69	CODE
LOW	python/pyspark/__init__.py	69	CODE
LOW	python/pyspark/__init__.py	70	CODE
LOW	python/pyspark/__init__.py	71	CODE
LOW	python/pyspark/__init__.py	71	CODE
LOW	python/pyspark/__init__.py	72	CODE
LOW	python/pyspark/__init__.py	72	CODE
LOW	python/pyspark/__init__.py	73	CODE
LOW	python/pyspark/__init__.py	73	CODE
LOW	python/pyspark/__init__.py	73	CODE
LOW	python/pyspark/__init__.py	74	CODE
LOW	python/pyspark/__init__.py	74	CODE
LOW	python/pyspark/__init__.py	75	CODE
LOW	python/pyspark/__init__.py	76	CODE
LOW	python/pyspark/__init__.py	131	CODE
LOW	python/pyspark/__init__.py	56	CODE
LOW	python/pyspark/__init__.py	56	CODE
LOW	python/pyspark/__init__.py	57	CODE
LOW	python/pyspark/__init__.py	58	CODE
552 more matches not shown…

Self-Referential Comments144 hits · 403 pts

Severity	File	Line	Snippet	Context
MEDIUM	bin/docker-image-tool.sh	80	# Create a smaller build context for docker in dev builds to make the build faster. Docker	COMMENT
MEDIUM	python/run-tests.py	268	# Create a unique temp directory under 'target/' for each run. The TMPDIR variable is	COMMENT
MEDIUM	python/run-tests.py	544	# Create the target directory before starting tasks to avoid races.	COMMENT
MEDIUM	python/pyspark/java_gateway.py	77	# Create a temporary directory where the gateway server should write the connection	COMMENT
MEDIUM	python/pyspark/statcounter.py	18	# This file is ported from spark/util/StatCounter.scala	COMMENT
MEDIUM	python/pyspark/daemon.py	116	# Create a new process group to corral our children	COMMENT
MEDIUM	python/pyspark/daemon.py	122	# Create a listening socket on the loopback interface	COMMENT
MEDIUM	python/pyspark/core/rdd.py	245	# This method is called when attempting to pickle an RDD, which is always an error:	COMMENT
MEDIUM	python/pyspark/core/rdd.py	2827	... # Create the conf for writing	STRING
MEDIUM	python/pyspark/core/rdd.py	2839	... # Create the conf for reading	STRING
MEDIUM	python/pyspark/core/rdd.py	2986	... # Create the conf for writing	STRING
MEDIUM	python/pyspark/core/rdd.py	2998	... # Create the conf for reading	STRING
MEDIUM	python/pyspark/core/context.py	298	# Create the Java SparkContext through Py4J	COMMENT
MEDIUM	python/pyspark/core/context.py	303	# Create a single Accumulator in Java that we'll send all our updates through;	COMMENT
MEDIUM	python/pyspark/core/context.py	381	# Create a temporary directory inside spark.local.dir:	COMMENT
MEDIUM	python/pyspark/core/context.py	491	# This method is called when attempting to pickle SparkContext, which is always an error:	STRING
MEDIUM	python/pyspark/core/context.py	1492	... # Create the conf for writing	CODE
MEDIUM	python/pyspark/core/context.py	1504	... # Create the conf for reading	CODE
MEDIUM	python/pyspark/core/context.py	1689	... # Create the conf for writing	CODE
MEDIUM	python/pyspark/core/context.py	1701	... # Create the conf for reading	CODE
MEDIUM	python/pyspark/mllib/tests/test_linalg.py	534	# Create a CSC matrix with non-sorted indices	COMMENT
MEDIUM	python/pyspark/mllib/tests/test_streaming_algorithms.py	101	# Create a toy dataset by setting a tiny offset for each point.	COMMENT
MEDIUM	python/pyspark/mllib/tests/test_streaming_algorithms.py	396	# Create a model with initial Weights equal to coefs	COMMENT
MEDIUM	python/pyspark/tests/test_rdd.py	736	# Create a DataFrame with many columns, call a Python function on each row, and take only	COMMENT
MEDIUM	python/pyspark/pipelines/init_cli.py	52	# Create the storage directory	COMMENT
MEDIUM⚡	python/pyspark/pipelines/init_cli.py	65	# Create the transformations directory	COMMENT
MEDIUM⚡	python/pyspark/pipelines/init_cli.py	69	# Create the Python example file	COMMENT
MEDIUM⚡	python/pyspark/pipelines/init_cli.py	74	# Create the SQL example file	COMMENT
MEDIUM	python/pyspark/pipelines/tests/test_cli.py	376	# Create a minimal pipeline spec	STRING
MEDIUM	python/pyspark/pipelines/tests/test_cli.py	400	# Create a minimal pipeline spec	STRING
MEDIUM	python/pyspark/pipelines/tests/test_cli.py	425	# Create a minimal pipeline spec	STRING
MEDIUM	python/pyspark/ml/pipeline.py	188	# Create a new instance of this stage.	COMMENT
MEDIUM	python/pyspark/ml/pipeline.py	346	# Create a new instance of this stage.	COMMENT
MEDIUM	python/pyspark/ml/tuning.py	981	# Create a new instance of this stage.	COMMENT
MEDIUM	python/pyspark/ml/tuning.py	1559	# Create a new instance of this stage.	COMMENT
MEDIUM	python/pyspark/ml/tuning.py	1684	# Create a new instance of this stage.	COMMENT
MEDIUM	python/pyspark/ml/tests/test_feature.py	361	# Create a DataFrame	COMMENT
MEDIUM	python/pyspark/pandas/tests/io/test_io.py	34	# This file contains test cases for 'Serialization / IO / Conversion'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_time_series.py	26	# This file contains test cases for 'Time series-related'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_spark.py	34	# This file contains test cases for 'Spark-related'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_attrs.py	26	# This file contains test cases for 'Attributes and underlying data'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_constructor.py	34	# This file contains test cases for 'Constructor'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_conversion.py	25	# This file contains test cases for 'Conversion'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_reindexing.py	31	# This file contains test cases for 'Reindexing / Selection / Label manipulation'	COMMENT
MEDIUM	python/pyspark/pandas/tests/frame/test_reshaping.py	27	# This file contains test cases for 'Reshaping, sorting, transposing'	COMMENT
MEDIUM	python/pyspark/pandas/tests/computation/test_combine.py	25	# This file contains test cases for 'Combining / joining / merging'	COMMENT
MEDIUM	…on/pyspark/pandas/tests/computation/test_apply_func.py	29	# This file contains test cases for 'Function application, GroupBy & Window'	COMMENT
MEDIUM	…/pyspark/pandas/tests/computation/test_missing_data.py	27	# This file contains test cases for 'Missing data handling'	COMMENT
MEDIUM	…on/pyspark/pandas/tests/computation/test_binary_ops.py	26	# This file contains test cases for 'Binary operator functions'	COMMENT
MEDIUM	python/pyspark/pandas/tests/computation/test_compute.py	26	# This file contains test cases for 'Computations / Descriptive Stats'	COMMENT
MEDIUM	…thon/pyspark/pandas/tests/indexes/test_indexing_adv.py	56	# Create the equivalent of pdf.loc[3] as a Koalas Series	COMMENT
MEDIUM	…thon/pyspark/pandas/tests/indexes/test_indexing_adv.py	142	# Create the equivalent of pdf.loc[3] as a Koalas Series	COMMENT
MEDIUM	python/pyspark/pandas/tests/indexes/test_indexing.py	26	# This file contains test cases for 'Indexing, Iteration'	COMMENT
MEDIUM	python/pyspark/pandas/indexes/base.py	263	# This method is used via `DataFrame.info` API internally.	COMMENT
MEDIUM	python/pyspark/sql/dataframe.py	563	... # Create a table with Rate source.	STRING
MEDIUM	python/pyspark/sql/dataframe.py	6831	>>> # Create a simple UDTF that processes table data	STRING
MEDIUM	python/pyspark/sql/dataframe.py	6837	>>> # Create a DataFrame	STRING
MEDIUM	python/pyspark/sql/session.py	624	# Create a new SparkSession in the JVM	COMMENT
MEDIUM	python/pyspark/sql/session.py	1647	# Create a DataFrame from pandas DataFrame.	COMMENT
MEDIUM	python/pyspark/sql/session.py	1652	# Create a DataFrame from PyArrow Table.	COMMENT
84 more matches not shown…

Decorative Section Separators103 hits · 376 pts

Severity	File	Line	Snippet	Context
MEDIUM	python/pyspark/cloudpickle/cloudpickle.py	662	# -------------------------------------------------	COMMENT
MEDIUM	python/pyspark/cloudpickle/cloudpickle.py	698	# ------------------------------------	COMMENT
MEDIUM	python/pyspark/cloudpickle/cloudpickle.py	704	# -----------------------------------	COMMENT
MEDIUM	python/pyspark/cloudpickle/cloudpickle.py	816	# -------------------------------	COMMENT
MEDIUM	python/pyspark/cloudpickle/cloudpickle.py	1125	# ------------------------------------	COMMENT
MEDIUM	python/pyspark/cloudpickle/cloudpickle.py	1207	# ---------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	92	# =========================================================================	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	94	# =========================================================================	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	96	# -------------------------------------------------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	287	# =========================================================================	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	289	# =========================================================================	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	291	# -------------------------------------------------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	359	# -------------------------------------------------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	361	# -------------------------------------------------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	400	# -------------------------------------------------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	402	# -------------------------------------------------------------------------	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	560	# =========================================================================	COMMENT
MEDIUM⚡	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	562	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	50	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	52	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	67	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	69	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	208	# -------------------------------------------------------------------------	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	210	# -------------------------------------------------------------------------	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	238	# -------------------------------------------------------------------------	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	240	# -------------------------------------------------------------------------	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	438	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	440	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	528	# =========================================================================	COMMENT
MEDIUM	…/upstream/pyarrow/test_pyarrow_array_type_inference.py	530	# =========================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	189	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	191	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	196	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	198	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	206	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	208	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	216	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	218	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	228	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	230	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	240	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	242	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	258	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	260	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	268	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	270	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	286	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	288	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	292	# =====================================================================	COMMENT
MEDIUM⚡	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	294	# =====================================================================	COMMENT
MEDIUM	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	150	# =====================================================================	COMMENT
MEDIUM	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	152	# =====================================================================	COMMENT
MEDIUM	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	176	# =====================================================================	COMMENT
MEDIUM	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	178	# =====================================================================	COMMENT
MEDIUM	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	367	# =====================================================================	COMMENT
MEDIUM	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	369	# =====================================================================	COMMENT
MEDIUM⚡	…k/tests/upstream/pyarrow/test_pyarrow_type_coercion.py	56	# =========================================================================	COMMENT
MEDIUM⚡	…k/tests/upstream/pyarrow/test_pyarrow_type_coercion.py	58	# =========================================================================	COMMENT
MEDIUM⚡	…k/tests/upstream/pyarrow/test_pyarrow_type_coercion.py	85	# =========================================================================	COMMENT
MEDIUM⚡	…k/tests/upstream/pyarrow/test_pyarrow_type_coercion.py	87	# =========================================================================	COMMENT
43 more matches not shown…

Deep Nesting382 hits · 322 pts

Severity	File	Line	Context
LOW	python/run-tests.py	236	CODE
LOW	python/run-tests.py	474	CODE
LOW	python/pyspark/worker.py	678	CODE
LOW	python/pyspark/worker.py	710	CODE
LOW	python/pyspark/worker.py	754	CODE
LOW	python/pyspark/worker.py	858	CODE
LOW	python/pyspark/worker.py	1991	CODE
LOW	python/pyspark/worker.py	3651	CODE
LOW	python/pyspark/worker.py	1387	CODE
LOW	python/pyspark/worker.py	963	CODE
LOW	python/pyspark/worker.py	1125	CODE
LOW	python/pyspark/worker.py	1401	CODE
LOW	python/pyspark/worker.py	1481	CODE
LOW	python/pyspark/worker.py	1499	CODE
LOW	python/pyspark/worker.py	2409	CODE
LOW	python/pyspark/worker.py	2476	CODE
LOW	python/pyspark/worker.py	3719	CODE
LOW	python/pyspark/worker.py	1548	CODE
LOW	python/pyspark/worker.py	1661	CODE
LOW	python/pyspark/worker.py	3737	CODE
LOW	python/pyspark/worker.py	1409	CODE
LOW	python/pyspark/worker.py	1427	CODE
LOW	python/pyspark/worker.py	1437	CODE
LOW	python/pyspark/worker.py	1459	CODE
LOW	python/pyspark/worker_message.py	135	CODE
LOW	python/pyspark/util.py	578	CODE
LOW	python/pyspark/conf.py	180	CODE
LOW	python/pyspark/shuffle.py	62	CODE
LOW	python/pyspark/shuffle.py	779	CODE
LOW	python/pyspark/statcounter.py	60	CODE
LOW	python/pyspark/accumulators.py	263	CODE
LOW	python/pyspark/accumulators.py	268	CODE
LOW	python/pyspark/profiler.py	189	CODE
LOW	python/pyspark/daemon.py	46	CODE
LOW	python/pyspark/daemon.py	115	CODE
LOW	python/pyspark/core/rdd.py	2210	CODE
LOW	python/pyspark/core/rdd.py	3672	CODE
LOW	python/pyspark/core/rdd.py	3724	CODE
LOW	python/pyspark/core/context.py	225	CODE
LOW	python/pyspark/core/context.py	1816	CODE
LOW	python/pyspark/logger/worker_io.py	214	CODE
LOW	python/pyspark/cloudpickle/cloudpickle.py	313	CODE
LOW	python/pyspark/cloudpickle/cloudpickle.py	338	CODE
LOW	python/pyspark/cloudpickle/cloudpickle.py	1069	CODE
LOW	python/pyspark/cloudpickle/cloudpickle.py	1441	CODE
LOW	python/pyspark/mllib/classification.py	236	CODE
LOW	python/pyspark/mllib/common.py	75	CODE
LOW	python/pyspark/mllib/common.py	96	CODE
LOW	python/pyspark/mllib/common.py	160	CODE
LOW	python/pyspark/mllib/linalg/__init__.py	96	CODE
LOW	python/pyspark/mllib/linalg/__init__.py	114	CODE
LOW	python/pyspark/mllib/linalg/__init__.py	415	CODE
LOW	python/pyspark/mllib/linalg/__init__.py	824	CODE
LOW	python/pyspark/tests/test_serializers.py	213	CODE
LOW	python/pyspark/tests/test_worker.py	38	CODE
LOW	python/pyspark/tests/test_shuffle.py	66	CODE
LOW	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	84	CODE
LOW	…park/tests/upstream/pyarrow/test_pyarrow_array_cast.py	137	CODE
LOW	python/pyspark/pipelines/cli.py	121	CODE
LOW	python/pyspark/pipelines/cli.py	221	CODE
322 more matches not shown…

Excessive Try-Catch Wrapping206 hits · 230 pts

Severity	File	Line	Snippet	Context
LOW	python/conf_vscode/sitecustomize.py	38	except Exception:	CODE
MEDIUM	python/pyspark/worker.py	1599	def mapper(_, it):	CODE
MEDIUM	python/pyspark/worker.py	1930	def evaluate(*a) -> tuple:	CODE
MEDIUM	python/pyspark/worker.py	3737	def _reader_thread():	CODE
LOW	python/pyspark/worker.py	1363	except Exception as e:	CODE
LOW	python/pyspark/worker.py	1542	except Exception as e:	CODE
LOW	python/pyspark/worker.py	1713	except Exception as e:	CODE
LOW	python/pyspark/worker.py	1728	except Exception as e:	CODE
LOW	python/pyspark/worker.py	1839	except Exception as e:	CODE
LOW	python/pyspark/worker.py	1935	except Exception as e:	CODE
LOW	python/pyspark/worker.py	3755	except Exception as e:	CODE
LOW	python/pyspark/worker.py	3792	except Exception:	CODE
LOW	python/pyspark/worker.py	3845	except Exception:	CODE
LOW⚡	python/pyspark/threaddump.py	45	except Exception as e:	CODE
MEDIUM⚡	python/pyspark/threaddump.py	46	print(f"Error getting children of process {args.pid}: {e}")	CODE
LOW⚡	python/pyspark/threaddump.py	54	except Exception:	CODE
MEDIUM	python/pyspark/threaddump.py	28	def main() -> int:	CODE
LOW	python/pyspark/util.py	788	except Exception:	CODE
LOW	python/pyspark/serializers.py	446	except Exception as e:	CODE
LOW	python/pyspark/serializers.py	495	except Exception:	CODE
LOW	python/pyspark/shell.py	73	except Exception:	CODE
LOW	python/pyspark/shell.py	90	except Exception:	CODE
LOW	python/pyspark/memory_profiler_ext.py	32	except Exception:	CODE
LOW	python/pyspark/memory_profiler_ext.py	69	except Exception:	CODE
LOW	python/pyspark/install.py	164	except Exception:	CODE
LOW	python/pyspark/install.py	214	except Exception:	CODE
LOW	python/pyspark/install.py	246	except Exception as e:	CODE
LOW	python/pyspark/instrumentation_utils.py	48	except Exception as ex:	CODE
LOW	python/pyspark/instrumentation_utils.py	72	except Exception as ex:	CODE
LOW	python/pyspark/daemon.py	96	except Exception:	CODE
LOW	python/pyspark/daemon.py	272	except Exception:	CODE
LOW	python/pyspark/core/context.py	374	except Exception:	CODE
LOW	python/pyspark/core/broadcast.py	181	except Exception as e:	CODE
LOW	python/pyspark/logger/worker_io.py	252	except Exception:	CODE
LOW	python/pyspark/cloudpickle/cloudpickle.py	232	except Exception:	CODE
LOW	python/pyspark/tests/test_rdd.py	354	except Exception:	CODE
LOW	python/pyspark/tests/test_rdd.py	919	except Exception:	CODE
LOW	python/pyspark/tests/test_taskcontext.py	206	except Exception:	CODE
LOW	python/pyspark/tests/test_taskcontext.py	277	except Exception:	CODE
MEDIUM	python/pyspark/tests/test_taskcontext.py	203	def f(iterator):	CODE
LOW	python/pyspark/tests/test_util.py	182	except Exception as e:	CODE
LOW	python/pyspark/tests/test_pin_thread.py	68	except Exception as e:	CODE
LOW	python/pyspark/tests/test_pin_thread.py	123	except Exception:	CODE
LOW	python/pyspark/tests/test_worker.py	55	except Exception:	CODE
LOW⚡	python/pyspark/tests/test_worker.py	155	except Exception:	CODE
MEDIUM	python/pyspark/tests/test_worker.py	52	def run():	CODE
MEDIUM	python/pyspark/tests/test_worker.py	152	def count():	CODE
LOW	python/pyspark/tests/test_context.py	237	except Exception:	CODE
LOW	python/pyspark/tests/test_install_spark.py	53	except Exception:	CODE
LOW	…stream/pyarrow/test_pyarrow_arrow_to_pandas_default.py	399	except Exception as e:	CODE
LOW⚡	…park/tests/upstream/pyarrow/test_pyarrow_array_cast.py	134	except Exception as e:	CODE
LOW	python/pyspark/testing/sqlutils.py	110	except Exception:	CODE
LOW	python/pyspark/testing/sqlutils.py	168	except Exception as e:	CODE
LOW	python/pyspark/testing/goldenutils.py	189	except Exception as e:	CODE
MEDIUM	python/pyspark/testing/utils.py	368	def _terminal_color_support():	CODE
LOW	python/pyspark/testing/utils.py	128	except Exception as e:	CODE
LOW	python/pyspark/testing/utils.py	140	except Exception as e:	CODE
LOW	python/pyspark/testing/utils.py	373	except Exception:	CODE
LOW	python/pyspark/ml/functions.py	848	except Exception as e:	CODE
LOW	python/pyspark/ml/wrapper.py	66	except Exception:	CODE
146 more matches not shown…

AI Slop Vocabulary54 hits · 136 pts

Severity	File	Line	Snippet	Context
MEDIUM	…/org/apache/spark/launcher/AbstractCommandBuilder.java	244	// Place slf4j-api-* jar first to be robust	COMMENT
MEDIUM	…c/test/scala/org/apache/spark/ui/UISeleniumSuite.scala	414	// Essentially, we want to check that none of the stage rows show	COMMENT
MEDIUM	…c/test/scala/org/apache/spark/ui/UISeleniumSuite.scala	470	// Essentially, we want to check that none of the stage rows show	COMMENT
MEDIUM	…ala/org/apache/spark/scheduler/DAGSchedulerSuite.scala	2376	// For a robust test assertion, limit number of job tasks to 1; that is,	COMMENT
MEDIUM	…apache/spark/scheduler/SchedulerIntegrationSuite.scala	431	// it really can only be "best-effort" in any case, and the scheduler should be robust to that.	STRING
MEDIUM	…rg/apache/spark/scheduler/TaskSchedulerImplSuite.scala	492	// Even though we launched a local task above, we still utilize non-local exec2.	COMMENT
MEDIUM	…he/spark/scheduler/HealthTrackerIntegrationSuite.scala	80	// robust to one bad node.	COMMENT
MEDIUM	…main/scala/org/apache/spark/storage/BlockManager.scala	1224	// BlockTransferService, which will leverage it to spill the block; if not, then passed-in	COMMENT
MEDIUM	python/packaging/classic/setup.py	171	# TODO(SPARK-32837) leverage pip's custom options	COMMENT
LOW	…spark/messages/socket/spark_socket_message_receiver.py	49	# For socket communication, we just pass along the underlying socket	COMMENT
LOW	python/pyspark/tests/test_install_spark.py	60	# we just use a hard-coded version.	COMMENT
LOW	python/pyspark/ml/tests/test_functions.py	208	# just return the batch size as the "prediction"	STRING
MEDIUM	python/pyspark/errors/utils.py	378	# Excluding Python magic methods that do not utilize JVM functions.	COMMENT
LOW	python/pyspark/pandas/resample.py	359	# here just use Pandas' resample on a 1-length series to get it.	COMMENT
LOW	python/pyspark/pandas/generic.py	3101	# If Series has only a single value, just return it as a scalar.	STRING
MEDIUM	python/pyspark/pandas/series.py	6093	# If `where` has duplicate items, leverage the pandas directly	COMMENT
LOW	python/pyspark/pandas/utils.py	794	# '+' is meaningless for writing methods, but pandas just pass it as 'w'.	COMMENT
LOW	python/pyspark/pandas/utils.py	798	# '+' is meaningless for writing methods, but pandas just pass it as 'a'.	COMMENT
LOW	python/pyspark/pandas/frame.py	10233	# In this case, we can simply use `summary` to calculate the stats.	COMMENT
MEDIUM	python/pyspark/pandas/tests/groupby/test_stat.py	30	# TODO: All statistical functions should leverage this utility	COMMENT
MEDIUM	python/pyspark/sql/session.py	601	# used in conjunction with Spark Connect mode.	COMMENT
MEDIUM⚡	python/pyspark/sql/tests/test_functions.py	3091	"""Test tuple_sketch_agg + operations + estimate comprehensive test - double"""	STRING
MEDIUM	python/pyspark/sql/tests/test_functions.py	3146	"""Test tuple_sketch_agg + operations + estimate comprehensive test - integer"""	STRING
MEDIUM	…ing/test_pandas_transform_with_state_state_variable.py	354	# TODO SPARK-50908 holistic fix for TTL suite	COMMENT
MEDIUM	python/pyspark/sql/connect/client/core.py	454	# Rewrite the URL to use http as the scheme so that we can leverage	COMMENT
LOW	R/pkg/R/sparkR.R	664	#' To remove/unset property simply set `value` to NULL e.g. setLocalProperty("key", NULL)	COMMENT
MEDIUM	R/pkg/R/column.R	296	#' Can be used in conjunction with \code{when} to specify a default value for expressions.	COMMENT
MEDIUM	…apache/spark/streaming/ReceivedBlockTrackerSuite.scala	320	// deletion more robust rather than a parallelized operation where we fire and forget	COMMENT
MEDIUM	…cala/org/apache/spark/streaming/ui/StreamingPage.scala	163	// We leverage timeFormat as the value would be same as timeFormat. This means it is	COMMENT
MEDIUM	…rg/apache/spark/network/crypto/CtrTransportCipher.java	229	// to utilize two helper ByteArrayWritableChannel for streaming. One is used to receive raw data	COMMENT
MEDIUM	…network/shuffle/streaming/StreamingShuffleMessage.java	68	// Essentially, other message types from reader to writer won't have a valid sequence number.	COMMENT
MEDIUM	…scala/org/apache/spark/examples/mllib/LDAExample.scala	139	// add (1.0 / actualCorpusSize) to MiniBatchFraction be more robust on tiny datasets.	COMMENT
MEDIUM	…a/org/apache/spark/sql/StatisticsCollectionSuite.scala	934	// We can't leverage LogicalRDD.fromDataset here, since it triggers physical planning and	COMMENT
MEDIUM	…c/test/scala/org/apache/spark/sql/DataFrameSuite.scala	1637	// We can't leverage LogicalRDD.fromDataset here, since it triggers physical planning and	COMMENT
MEDIUM	…apache/spark/sql/streaming/FileStreamSourceSuite.scala	2342	// file stream source will not leverage unread files - next batch will also trigger	COMMENT
MEDIUM	…org/apache/spark/sql/execution/UnionCodegenSuite.scala	535	// Explicit cap so the assertion is robust to future default changes.	STRING
MEDIUM	…l/execution/datasources/PushVariantIntoScanSuite.scala	476	// Project/Filter nodes wrap it. This keeps scan-content assertions robust against optimizer	COMMENT
MEDIUM	…ion/datasources/v2/state/StateDataSourceTestBase.scala	103	// check with more data - leverage full partitions	COMMENT
MEDIUM	…park/sql/catalyst/analysis/ResolveSessionCatalog.scala	227	// resolution was skipped) so the rewrite stays robust across analyzer ordering changes.	COMMENT
MEDIUM	…in/scala/org/apache/spark/sql/jdbc/OracleDialect.scala	166	// Not sure if there is a more robust way to identify the field as a float (or other	COMMENT
MEDIUM	…icpruning/RowLevelOperationRuntimeGroupFiltering.scala	78	// in order to leverage a regular batch scan in the group filter query	COMMENT
MEDIUM	…on/python/streaming/ApplyInPandasWithStateWriter.scala	107	// from the entire data part of Arrow RecordBatch. We leverage the state metadata to also	COMMENT
MEDIUM	…ors/stateful/join/StreamingSymmetricHashJoinExec.scala	1098	// to let users leverage both sides of event time column for output of join, so the watermark	COMMENT
MEDIUM	…/execution/streaming/runtime/FileStreamSourceLog.scala	130	// be started. We leverage the fact to skip calculation if possible.	COMMENT
MEDIUM	…sql/execution/streaming/runtime/ProgressReporter.scala	572	// by itself, so leverage it.	COMMENT
MEDIUM	…ark/sql/catalyst/expressions/CodeGenerationSuite.scala	603	\| // to make the test more robust, in case the compiler can eliminate the else branch.	STRING
MEDIUM	…e/spark/sql/catalyst/analysis/RelationResolution.scala	400	// To utilize this code path to execute V1 commands, e.g. INSERT,	COMMENT
MEDIUM	…ql/catalyst/expressions/SubExprEvaluationRuntime.scala	100	// We leverage `IdentityHashMap` so we compare expression keys by reference here.	COMMENT
MEDIUM	…k/sql/catalyst/expressions/codegen/CodeFormatter.scala	119	// examines the number of parenthesis and braces in that line. This isn't the most robust	COMMENT
MEDIUM	…/spark/sql/hive/execution/HiveCompatibilitySuite.scala	287	// The isolated classloader seemed to make some of our test reset mechanisms less robust.	COMMENT
MEDIUM	…n/scala/org/apache/spark/sql/hive/HiveInspectors.scala	1057	// TODO: hard-coding a list here is not very robust. A better idea is to have some kind of query	COMMENT
MEDIUM	…/main/java/org/apache/spark/sql/streaming/Trigger.java	97	* @deprecated This is deprecated as of Spark 3.4.0. Use {@link #AvailableNow()} to leverage	COMMENT
MEDIUM	…k/sql/connect/pipelines/PipelineEventSenderSuite.scala	284	// total is logged at shutdown. The assertions match on substrings so they stay robust to	COMMENT
MEDIUM	…e/spark/sql/hive/thriftserver/SharedThriftServer.scala	134	// It's much more robust than set a random port generated by ourselves ahead	COMMENT

Structural Annotation Overuse68 hits · 136 pts

Severity	File	Line	Snippet	Context
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1058	// Step 1: Write an in-progress log containing only ApplicationStart (no job).	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1065	// Step 2: Load the app UI; this builds the disk store from the in-progress snapshot.	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1071	// Step 3: Simulate ApplicationCache LRU eviction BEFORE the app completes.	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1081	// Step 4: Complete the app. Write a new log file (without .inprogress suffix) that	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1091	// Step 5: checkForLogs() detects the completed log.	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1096	// Step 6: Load the UI again.	COMMENT
LOW⚡	python/pyspark/sql/conversion.py	185	# Step 1: pick source columns from batch to align with target schema	COMMENT
LOW	python/pyspark/sql/conversion.py	214	# Step 2: check types / cast, collect all mismatches	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	109	# Step 1: Write initial data and run streaming query	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	116	# Step 2: Repartition to more partitions	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	121	# Step 3: Add more data and restart query	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	129	# Step 4: Repartition to fewer partitions	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	134	# Step 5: Add more data and restart query	COMMENT
LOW⚡	R/pkg/inst/worker/worker.R	247	# Step 1: hash the data to an environment	COMMENT
LOW	R/pkg/inst/worker/worker.R	264	# Step 2: write out all of the environment as key-value pairs.	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1739	// Step 1: Run the stateful query to create the full checkpoint structure	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1742	// Step 2: Delete the state directory	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1748	// Step 3: Attempt to read state - expected to fail since state is deleted	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1754	// Step 4: Verify the state directory was NOT recreated by the reader	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1866	// Step 1: Write data with correct schema and commit	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1877	// Step 2: Reopen with a wrong valueSchema (StringType instead of IntegerType)	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1906	// Step 1: Write data with correct schema and commit	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1918	// Step 2: Reopen with a wrong valueSchema (StringType instead of IntegerType)	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	228	// Step 1: Create state by running a streaming aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	270	// Step 1: Create state by running a composite key streaming aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	304	// Step 1: Create state by running stream-stream join	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	316	// Step 2: Test all 4 state stores created by stream-stream join	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	343	// Step 1: Create state by running flatMapGroupsWithState	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	813	// Step 1: Create state by running dropDuplicatesWithinWatermark	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	838	// Step 1: Create state by running dropDuplicates with column	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	863	// Step 1: Create state by running session window aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	892	// Step 1: Create state by running a streaming aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	965	// Step 1: Create state by running a streaming aggregation	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	128	// Step 1: Run initial query to create state	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	131	// Step 2: Read state data before repartition	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	150	// Step 3: Run repartition	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	157	// Step 4: Verify offset and commit logs	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	162	// Step 5: Validate state for each store and column family after repartition	COMMENT
LOW	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	190	// Step 6: Resume query with new input and verify	COMMENT
LOW	…scala/org/apache/spark/sql/avro/AvroOutputWriter.scala	38	// NOTE: This class is instantiated and used on executor side only, no need to be serializable.	COMMENT
LOW⚡	…la/org/apache/spark/sql/execution/SparkSqlParser.scala	140	// Step 1: Apply variable substitution to expand any variable references.	COMMENT
LOW⚡	…la/org/apache/spark/sql/execution/SparkSqlParser.scala	143	// Step 2: Apply parameter substitution if a parameter context is provided.	COMMENT
LOW	…la/org/apache/spark/sql/execution/SparkSqlParser.scala	168	// Step 3: Set up the origin with SQL text and position mapper to enable	COMMENT
LOW	…/parquet/ParquetOutputWriterWithVariantShredding.scala	39	// NOTE: This class is instantiated and used on executor side only, no need to be serializable.	COMMENT
LOW	…xecution/datasources/parquet/ParquetOutputWriter.scala	27	// NOTE: This class is instantiated and used on executor side only, no need to be serializable.	COMMENT
LOW⚡	…/execution/aggregate/TungstenAggregationIterator.scala	264	// Step 5: Get the sorted iterator from the externalSorter.	COMMENT
LOW⚡	…/execution/aggregate/TungstenAggregationIterator.scala	267	// Step 6: Pre-load the first key-value pair from the sorted iterator to make	COMMENT
LOW	…/execution/aggregate/TungstenAggregationIterator.scala	280	// Step 7: set sortBased to true.	COMMENT
LOW⚡	…t/analysis/SequentialStreamingUnionAnalysisSuite.scala	227	// Step 1: Flatten the nested unions	COMMENT
LOW⚡	…t/analysis/SequentialStreamingUnionAnalysisSuite.scala	236	// Step 2: Validate the flattened plan	COMMENT
LOW⚡	…/spark/sql/catalyst/optimizer/MergeSubplansSuite.scala	762	// Step 1: subquery1 (cp) and subquery2 (np) merge:	COMMENT
LOW⚡	…/spark/sql/catalyst/optimizer/MergeSubplansSuite.scala	769	// Step 2: subquery3 (np) merges with merged(1,2) (cp). The cp Filter is tagged, so only a	COMMENT
LOW⚡	…/spark/sql/catalyst/optimizer/MergeSubplansSuite.scala	818	// Step 1: subquery1 (cp) and subquery2 (np) merge as usual:	COMMENT
LOW⚡	…/spark/sql/catalyst/optimizer/MergeSubplansSuite.scala	824	// Step 2: subquery3 (np, condition a > 1) merges with merged(1,2) (cp). The cp Filter is	COMMENT
LOW⚡	…a/org/apache/spark/sql/connect/SparkSessionSuite.scala	182	// Step 0 - check initial state	COMMENT
LOW⚡	…a/org/apache/spark/sql/connect/SparkSessionSuite.scala	187	// Step 1 - new active session in script 2	COMMENT
LOW⚡	…a/org/apache/spark/sql/connect/SparkSessionSuite.scala	195	// Step 3 - close session 1, no more default session in both scripts	COMMENT
LOW⚡	…a/org/apache/spark/sql/connect/SparkSessionSuite.scala	199	// Step 4 - no default session, same active session.	COMMENT
LOW⚡	…a/org/apache/spark/sql/connect/SparkSessionSuite.scala	204	// Step 5 - clear active session in script 1	COMMENT
LOW⚡	…a/org/apache/spark/sql/connect/SparkSessionSuite.scala	208	// Step 6 - no default/no active session in script 1, script2 unchanged.	COMMENT
8 more matches not shown…

AI Structural Patterns129 hits · 122 pts

Severity	File	Line	Context
LOW	python/pyspark/core/context.py	171	CODE
LOW	python/pyspark/mllib/regression.py	292	CODE
LOW	python/pyspark/mllib/regression.py	476	CODE
LOW	python/pyspark/mllib/regression.py	656	CODE
LOW	python/pyspark/mllib/classification.py	331	CODE
LOW	python/pyspark/mllib/classification.py	423	CODE
LOW	python/pyspark/mllib/classification.py	646	CODE
LOW	python/pyspark/pipelines/api.py	113	CODE
LOW	python/pyspark/pipelines/api.py	126	CODE
LOW	python/pyspark/pipelines/api.py	217	CODE
LOW	python/pyspark/pipelines/api.py	230	CODE
LOW	python/pyspark/streaming/kinesis.py	69	CODE
LOW	python/pyspark/streaming/kinesis.py	88	CODE
LOW	python/pyspark/testing/utils.py	692	CODE
LOW	python/pyspark/ml/regression.py	303	CODE
LOW	python/pyspark/ml/regression.py	337	CODE
LOW	python/pyspark/ml/regression.py	1112	CODE
LOW	python/pyspark/ml/regression.py	1148	CODE
LOW	python/pyspark/ml/regression.py	1412	CODE
LOW	python/pyspark/ml/regression.py	1452	CODE
LOW	python/pyspark/ml/regression.py	1762	CODE
LOW	python/pyspark/ml/regression.py	1804	CODE
LOW	python/pyspark/ml/regression.py	2163	CODE
LOW	python/pyspark/ml/regression.py	2203	CODE
LOW	python/pyspark/ml/regression.py	2568	CODE
LOW	python/pyspark/ml/regression.py	2604	CODE
LOW	python/pyspark/ml/regression.py	3138	CODE
LOW	python/pyspark/ml/regression.py	3169	CODE
LOW	python/pyspark/ml/clustering.py	408	CODE
LOW	python/pyspark/ml/clustering.py	438	CODE
LOW	python/pyspark/ml/clustering.py	783	CODE
LOW	python/pyspark/ml/clustering.py	815	CODE
LOW	python/pyspark/ml/clustering.py	1141	CODE
LOW	python/pyspark/ml/clustering.py	1167	CODE
LOW	python/pyspark/ml/clustering.py	1699	CODE
LOW	python/pyspark/ml/clustering.py	1737	CODE
LOW	python/pyspark/ml/classification.py	723	CODE
LOW	python/pyspark/ml/classification.py	755	CODE
LOW	python/pyspark/ml/classification.py	1241	CODE
LOW	python/pyspark/ml/classification.py	1267	CODE
LOW	python/pyspark/ml/classification.py	1293	CODE
LOW	python/pyspark/ml/classification.py	1338	CODE
LOW	python/pyspark/ml/classification.py	1364	CODE
LOW	python/pyspark/ml/classification.py	1391	CODE
LOW	python/pyspark/ml/classification.py	1781	CODE
LOW	python/pyspark/ml/classification.py	1819	CODE
LOW	python/pyspark/ml/classification.py	2083	CODE
LOW	python/pyspark/ml/classification.py	2126	CODE
LOW	python/pyspark/ml/classification.py	2561	CODE
LOW	python/pyspark/ml/classification.py	2605	CODE
LOW	python/pyspark/ml/classification.py	2965	CODE
LOW	python/pyspark/ml/classification.py	2992	CODE
LOW	python/pyspark/ml/classification.py	3215	CODE
LOW	python/pyspark/ml/classification.py	3247	CODE
LOW	python/pyspark/ml/classification.py	4110	CODE
LOW	python/pyspark/ml/classification.py	4147	CODE
LOW	python/pyspark/ml/evaluation.py	579	CODE
LOW	python/pyspark/ml/evaluation.py	690	CODE
LOW	python/pyspark/ml/feature.py	3625	CODE
LOW	python/pyspark/ml/feature.py	3673	CODE
69 more matches not shown…

Verbosity Indicators62 hits · 118 pts

Severity	File	Line	Snippet	Context
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1058	// Step 1: Write an in-progress log containing only ApplicationStart (no job).	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1065	// Step 2: Load the app UI; this builds the disk store from the in-progress snapshot.	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1071	// Step 3: Simulate ApplicationCache LRU eviction BEFORE the app completes.	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1081	// Step 4: Complete the app. Write a new log file (without .inprogress suffix) that	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1091	// Step 5: checkForLogs() detects the completed log.	COMMENT
LOW⚡	…ache/spark/deploy/history/FsHistoryProviderSuite.scala	1096	// Step 6: Load the UI again.	COMMENT
LOW⚡	…/util/collection/unsafe/sort/UnsafeExternalSorter.java	474	// Step 1:	COMMENT
LOW⚡	…/util/collection/unsafe/sort/UnsafeExternalSorter.java	477	// Step 2:	COMMENT
LOW⚡	…/util/collection/unsafe/sort/UnsafeExternalSorter.java	480	// Step 3:	COMMENT
LOW	…/scala/org/apache/spark/storage/BlockInfoManager.scala	433	// reader counts. We need to check if the readLocksByTask per tasks are present, if they	COMMENT
LOW⚡	python/pyspark/sql/conversion.py	185	# Step 1: pick source columns from batch to align with target schema	COMMENT
LOW	python/pyspark/sql/conversion.py	214	# Step 2: check types / cast, collect all mismatches	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	109	# Step 1: Write initial data and run streaming query	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	116	# Step 2: Repartition to more partitions	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	121	# Step 3: Add more data and restart query	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	129	# Step 4: Repartition to fewer partitions	COMMENT
LOW⚡	…/streaming/test_streaming_offline_state_repartition.py	134	# Step 5: Add more data and restart query	COMMENT
LOW	…rk/sql/streaming/transform_with_state_driver_worker.py	72	# and the following code block should be only run once for each query run	COMMENT
LOW	R/pkg/inst/worker/daemon.R	98	# Forking succeeded and we need to check if they finished their jobs every second.	COMMENT
LOW⚡	R/pkg/inst/worker/worker.R	247	# Step 1: hash the data to an environment	COMMENT
LOW	R/pkg/inst/worker/worker.R	264	# Step 2: write out all of the environment as key-value pairs.	COMMENT
LOW	…ming/FlatMapGroupsWithStateWithInitialStateSuite.scala	57	// We need to check if not explicitly calling update will still save the init state or not	COMMENT
LOW	…ming/FlatMapGroupsWithStateWithInitialStateSuite.scala	124	// We need to check if not explicitly calling update will still save the state or not	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1739	// Step 1: Run the stateful query to create the full checkpoint structure	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1742	// Step 2: Delete the state directory	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1748	// Step 3: Attempt to read state - expected to fail since state is deleted	COMMENT
LOW⚡	…on/datasources/v2/state/StateDataSourceReadSuite.scala	1754	// Step 4: Verify the state directory was NOT recreated by the reader	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1866	// Step 1: Write data with correct schema and commit	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1877	// Step 2: Reopen with a wrong valueSchema (StringType instead of IntegerType)	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1906	// Step 1: Write data with correct schema and commit	COMMENT
LOW	…execution/streaming/state/RocksDBStateStoreSuite.scala	1918	// Step 2: Reopen with a wrong valueSchema (StringType instead of IntegerType)	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	228	// Step 1: Create state by running a streaming aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	270	// Step 1: Create state by running a composite key streaming aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	304	// Step 1: Create state by running stream-stream join	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	316	// Step 2: Test all 4 state stores created by stream-stream join	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	343	// Step 1: Create state by running flatMapGroupsWithState	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	813	// Step 1: Create state by running dropDuplicatesWithinWatermark	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	838	// Step 1: Create state by running dropDuplicates with column	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	863	// Step 1: Create state by running session window aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	892	// Step 1: Create state by running a streaming aggregation	COMMENT
LOW	…state/StatePartitionAllColumnFamiliesWriterSuite.scala	965	// Step 1: Create state by running a streaming aggregation	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	128	// Step 1: Run initial query to create state	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	131	// Step 2: Read state data before repartition	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	150	// Step 3: Run repartition	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	157	// Step 4: Verify offset and commit logs	COMMENT
LOW⚡	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	162	// Step 5: Validate state for each store and column family after repartition	COMMENT
LOW	…ng/state/OfflineStateRepartitionIntegrationSuite.scala	190	// Step 6: Resume query with new input and verify	COMMENT
LOW	…g/apache/spark/sql/classic/StreamingQueryManager.scala	321	// The following code block checks if a stream with the same name or id is running. Then it	COMMENT
LOW⚡	…la/org/apache/spark/sql/execution/SparkSqlParser.scala	140	// Step 1: Apply variable substitution to expand any variable references.	COMMENT
LOW⚡	…la/org/apache/spark/sql/execution/SparkSqlParser.scala	143	// Step 2: Apply parameter substitution if a parameter context is provided.	COMMENT
LOW	…la/org/apache/spark/sql/execution/SparkSqlParser.scala	168	// Step 3: Set up the origin with SQL text and position mapper to enable	COMMENT
LOW	…ql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala	133	// Also, we need to check if join is done on 2 tables from 2 different databases within same	COMMENT
LOW	…xecution/datasources/parquet/ParquetRowConverter.scala	835	// in case of schema evolution), we need to check if the repeated type matches one of the	STRING
LOW⚡	…/execution/aggregate/TungstenAggregationIterator.scala	264	// Step 5: Get the sorted iterator from the externalSorter.	COMMENT
LOW⚡	…/execution/aggregate/TungstenAggregationIterator.scala	267	// Step 6: Pre-load the first key-value pair from the sorted iterator to make	COMMENT
LOW	…/execution/aggregate/TungstenAggregationIterator.scala	280	// Step 7: set sortBased to true.	COMMENT
LOW⚡	…t/analysis/SequentialStreamingUnionAnalysisSuite.scala	227	// Step 1: Flatten the nested unions	COMMENT
LOW⚡	…t/analysis/SequentialStreamingUnionAnalysisSuite.scala	236	// Step 2: Validate the flattened plan	COMMENT
LOW⚡	…/spark/sql/catalyst/optimizer/MergeSubplansSuite.scala	762	// Step 1: subquery1 (cp) and subquery2 (np) merge:	COMMENT
LOW⚡	…/spark/sql/catalyst/optimizer/MergeSubplansSuite.scala	769	// Step 2: subquery3 (np) merges with merged(1,2) (cp). The cp Filter is tagged, so only a	COMMENT
2 more matches not shown…

Modern Structural Boilerplate118 hits · 117 pts

Severity	File	Line	Snippet	Context
LOW	python/packaging/connect/pyspark_connect/__init__.py	24	__all__ = [	CODE
LOW	python/pyspark/taskcontext.py	164	def _setTaskContext(cls: Type["TaskContext"], taskContext: Optional["TaskContext"]) -> None:	CODE
LOW	python/pyspark/conf.py	18	__all__ = ["SparkConf"]	CODE
LOW	python/pyspark/serializers.py	72	__all__ = [	CODE
LOW	python/pyspark/__init__.py	133	__all__ = [	CODE
LOW	python/pyspark/memory_profiler_ext.py	36	__all__ = [	CODE
LOW	python/pyspark/storagelevel.py	18	__all__ = ["StorageLevel"]	CODE
LOW	python/pyspark/accumulators.py	35	__all__ = ["Accumulator", "AccumulatorParam"]	CODE
LOW	python/pyspark/resultiterable.py	24	__all__ = ["ResultIterable"]	CODE
LOW	python/pyspark/messages/__init__.py	22	__all__ = [	CODE
LOW	python/pyspark/core/files.py	20	__all__ = ["SparkFiles"]	CODE
LOW	python/pyspark/core/rdd.py	119	__all__ = ["RDD"]	CODE
LOW	python/pyspark/core/context.py	81	__all__ = ["SparkContext"]	CODE
LOW	python/pyspark/core/broadcast.py	48	__all__ = ["Broadcast"]	CODE
LOW	python/pyspark/core/status.py	18	__all__ = ["SparkJobInfo", "SparkStageInfo", "SparkExecutorInfo", "StatusTracker"]	CODE
LOW	python/pyspark/logger/__init__.py	24	__all__ = ["PySparkLogger", "SPARK_LOG_SCHEMA"]	CODE
LOW	python/pyspark/cloudpickle/__init__.py	8	__all__ = [ # noqa	CODE
LOW	python/pyspark/mllib/tree.py	33	__all__ = [	CODE
LOW	python/pyspark/mllib/regression.py	51	__all__ = [	CODE
LOW	python/pyspark/mllib/clustering.py	41	__all__ = [	CODE
LOW	python/pyspark/mllib/classification.py	42	__all__ = [	CODE
LOW	python/pyspark/mllib/evaluation.py	28	__all__ = [	CODE
LOW	python/pyspark/mllib/__init__.py	33	__all__ = [	CODE
LOW	python/pyspark/mllib/feature.py	41	__all__ = [	CODE
LOW	python/pyspark/mllib/random.py	33	__all__ = [	CODE
LOW	python/pyspark/mllib/recommendation.py	28	__all__ = ["MatrixFactorizationModel", "ALS", "Rating"]	CODE
LOW	python/pyspark/mllib/fpm.py	27	__all__ = ["FPGrowth", "FPGrowthModel", "PrefixSpan", "PrefixSpanModel"]	CODE
LOW	python/pyspark/mllib/linalg/__init__.py	71	__all__ = [	CODE
LOW	python/pyspark/mllib/linalg/distributed.py	40	__all__ = [	CODE
LOW	python/pyspark/mllib/stat/__init__.py	27	__all__ = [	CODE
LOW	python/pyspark/mllib/stat/test.py	22	__all__ = ["ChiSqTestResult", "KolmogorovSmirnovTestResult"]	CODE
LOW	python/pyspark/mllib/stat/distribution.py	18	__all__ = ["MultivariateGaussian"]	CODE
LOW	python/pyspark/mllib/stat/_statistics.py	33	__all__ = ["MultivariateStatisticalSummary", "Statistics"]	CODE
LOW	python/pyspark/pipelines/__init__.py	27	__all__ = [	CODE
LOW	python/pyspark/streaming/dstream.py	50	__all__ = ["DStream"]	CODE
LOW	python/pyspark/streaming/kinesis.py	25	__all__ = ["KinesisUtils", "InitialPositionInStream", "MetricsLevel", "utf8_decoder"]	CODE
LOW	python/pyspark/streaming/__init__.py	22	__all__ = ["StreamingContext", "DStream", "StreamingListener"]	CODE
LOW	python/pyspark/streaming/context.py	31	__all__ = ["StreamingContext"]	CODE
LOW	python/pyspark/streaming/listener.py	20	__all__ = ["StreamingListener"]	CODE
LOW	python/pyspark/testing/__init__.py	21	__all__ = ["assertDataFrameEqual", "assertSchemaEqual", "main"]	CODE
LOW	python/pyspark/testing/utils.py	46	__all__ = ["assertDataFrameEqual", "assertSchemaEqual"]	CODE
LOW	python/pyspark/testing/connectutils.py	106	def _set_relation_in_plan(self, plan: pb2.Plan, relation: pb2.Relation) -> None:	CODE
LOW	python/pyspark/testing/connectutils.py	111	def _set_command_in_plan(self, plan: pb2.Plan, command: pb2.Command) -> None:	CODE
LOW	python/pyspark/ml/regression.py	85	__all__ = [	CODE
LOW	python/pyspark/ml/clustering.py	64	__all__ = [	CODE
LOW	python/pyspark/ml/classification.py	111	__all__ = [	CODE
LOW	python/pyspark/ml/evaluation.py	50	__all__ = [	CODE
LOW	python/pyspark/ml/__init__.py	49	__all__ = [	CODE
LOW	python/pyspark/ml/feature.py	77	__all__ = [	CODE
LOW	python/pyspark/ml/recommendation.py	41	__all__ = ["ALS", "ALSModel"]	CODE
LOW	python/pyspark/ml/tuning.py	70	__all__ = [	CODE
LOW	python/pyspark/ml/image.py	36	__all__ = ["ImageSchema"]	CODE
LOW	python/pyspark/ml/fpm.py	35	__all__ = ["FPGrowth", "FPGrowthModel", "PrefixSpan"]	CODE
LOW	python/pyspark/ml/linalg/__init__.py	58	__all__ = [	CODE
LOW	python/pyspark/ml/torch/distributor.py	685	def set_torch_config(context: "BarrierTaskContext") -> None:	STRING
LOW	python/pyspark/ml/torch/distributor.py	702	def set_gpus(context: "BarrierTaskContext") -> None:	STRING
LOW	python/pyspark/ml/torch/distributor.py	711	def set_gpus(context: "BarrierTaskContext") -> None:	STRING
LOW	python/pyspark/ml/param/__init__.py	41	__all__ = ["Param", "Params", "TypeConverters"]	CODE
LOW	python/pyspark/ml/connect/__init__.py	37	__all__ = [	CODE
LOW	python/pyspark/errors/__init__.py	57	__all__ = [	CODE
58 more matches not shown…

Redundant / Tautological Comments64 hits · 93 pts

Severity	File	Line	Snippet	Context
LOW	python/run-tests.py	506	# Check if the python executable has coverage installed when 'COVERAGE_PROCESS_START'	COMMENT
LOW	python/pyspark/worker.py	1180	# Check if this is a continuation of the previous batch's partition	COMMENT
LOW	python/pyspark/worker.py	1289	# Check if any partition column changed from previous row	COMMENT
LOW	python/pyspark/shell.py	56	# Check if th eprogress bar needs to be disabled.	COMMENT
LOW⚡	python/pyspark/pipelines/cli.py	74	# Check if it's a simple file path (no wildcards at all)	COMMENT
LOW⚡	python/pyspark/pipelines/cli.py	78	# Check if it's a folder path ending with /**	COMMENT
LOW	python/pyspark/pandas/frame.py	12670	# Check if DataFrame has rows - if yes, raise error; if no, return empty Series	COMMENT
LOW	python/pyspark/pandas/frame.py	12805	# Check if DataFrame has rows - if yes, raise error; if no, return empty Series	COMMENT
LOW	python/pyspark/pandas/data_type_ops/categorical_ops.py	116	# Check if categoricals have the same dtype, same categories, and same ordered	COMMENT
LOW	python/pyspark/pandas/typedef/typehints.py	648	# Check if the name is Tuple.	COMMENT
LOW	python/pyspark/pandas/indexes/base.py	2068	# Check if the `self` and `other` have different index types.	COMMENT
LOW	python/pyspark/sql/metrics.py	188	# Add yourself to the list if you have to.	COMMENT
LOW	python/pyspark/sql/dataframe.py	409	>>> # Check if the DataFrames are equal	STRING
LOW	python/pyspark/sql/session.py	2279	# Check if the target path already exists	COMMENT
LOW	python/pyspark/sql/types.py	3146	>>> # Check if numeric values are within the allowed range.	STRING
LOW⚡	python/pyspark/sql/tests/test_utils.py	1731	# Check if the error message contains information about 2 mismatches only.	COMMENT
LOW	python/pyspark/sql/tests/arrow/test_arrow_map.py	329	# Set it to a small odd value to exercise batching logic for all test cases	COMMENT
LOW	python/pyspark/sql/tests/pandas/bench_pipelined_udf.py	93	# Output results as JSON to stdout	COMMENT
LOW	…s/pandas/streaming/test_pandas_transform_with_state.py	1436	# Set it to a very small number so that every row would be a separate pandas df	COMMENT
LOW	…s/pandas/streaming/test_pandas_transform_with_state.py	1463	# Set it to a very large number so that every row would be in the same pandas df	COMMENT
LOW	…s/pandas/streaming/test_pandas_transform_with_state.py	1529	# Set it to a very small number so that every row would be a separate pandas df	COMMENT
LOW⚡	…/pyspark/sql/tests/pandas/streaming/test_tws_tester.py	751	# Set watermark to 15000 - key1's timer should fire.	COMMENT
LOW⚡	…/pyspark/sql/tests/pandas/streaming/test_tws_tester.py	756	# Set watermark to 16000 - key2's timer should fire.	COMMENT
LOW	…/pyspark/sql/tests/pandas/streaming/test_tws_tester.py	790	# Set watermark to 6000.	COMMENT
LOW	…/pyspark/sql/tests/pandas/streaming/test_tws_tester.py	821	# Set watermark to 20 seconds.	COMMENT
LOW	…/pyspark/sql/tests/pandas/streaming/test_tws_tester.py	923	# Set watermark to 10000.	COMMENT
LOW	python/pyspark/sql/streaming/readwriter.py	1550	# Check if the data should be processed	STRING
LOW	python/pyspark/sql/worker/plan_data_source_read.py	154	# Check if the names are the same as the schema.	COMMENT
LOW	python/pyspark/sql/worker/create_data_source.py	81	# Check if the provider name matches the data source's name.	COMMENT
LOW	python/pyspark/sql/worker/write_into_data_source.py	97	# Check if the provider name matches the data source's name.	COMMENT
LOW	python/pyspark/sql/connect/session.py	1121	# Check if total size exceeds the limit	COMMENT
LOW	python/pyspark/sql/connect/session.py	1131	# Check if adding this chunk would exceed batch size	COMMENT
LOW	python/pyspark/sql/connect/client/artifact.py	199	# Check if it is a file from the scheme	COMMENT
LOW	python/pyspark/sql/pandas/serializers.py	1093	# Check if the entire column is null	COMMENT
LOW	python/pyspark/sql/pandas/serializers.py	1336	# Check if the entire column is null	COMMENT
LOW	python/pyspark/sql/pandas/conversion.py	889	# Check if any columns need to be fixed for Spark to infer properly	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	69	# Check if all arguments have type hints	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	79	# Check if the return has a type hint	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	228	# Check if all arguments have type hints	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	238	# Check if the return has a type hint	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	421	# Check if all arguments have type hints	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	431	# Check if the return has a type hint	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	514	# Check if all arguments have type hints	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	524	# Check if the return has a type hint	COMMENT
LOW	python/pyspark/sql/pandas/typehints.py	600	# Check if the name is Tuple first. After that, check the generic types.	COMMENT
LOW	sbin/spark-daemon.sh	50	# Check if --config is passed as an argument. It is an optional parameter.	COMMENT
LOW	sbin/spark-daemon.sh	154	# Check if the process has died; in that case we'll tail the log so the user can see	COMMENT
LOW	sbin/decommission-worker.sh	48	# Check if --block-until-exit is set.	COMMENT
LOW	sbin/workers.sh	57	# Check if --config is passed as an argument. It is an optional parameter.	COMMENT
LOW	…l/src/test/scala/org/apache/spark/repl/ReplSuite.scala	254	\|# Set everything to be logged to the console	STRING
LOW	R/pkg/tests/fulltests/test_jvm_api.R	26	# Check if get returns the same element	COMMENT
LOW	R/pkg/R/sparkR.R	456	# Check if version number of SparkSession matches version number of SparkR package	COMMENT
LOW	R/pkg/R/serialize.R	45	# Check if all elements are of same type	COMMENT
LOW	R/pkg/R/jobj.R	31	# Check if jobj was created with the current SparkContext	COMMENT
LOW	R/pkg/R/DataFrame.R	386	# Check if the column names have . in it	COMMENT
LOW	R/pkg/R/DataFrame.R	2282	# Check if there is any duplicated column name in the DataFrame	COMMENT
LOW	R/pkg/inst/worker/worker.R	97	# Set libPaths to include SparkR package as loadNamespace needs this	COMMENT
LOW	.github/workflows/build_and_test.yml	1361	# Print the values of environment variables `SKIP_ERRORDOC`, `SKIP_SCALADOC`, `SKIP_PYTHONDOC`, `SKIP_RDOC` and	COMMENT
LOW	.github/workflows/build_and_test.yml	1385	# Print the values of environment variables `SKIP_ERRORDOC`, `SKIP_SCALADOC`, `SKIP_PYTHONDOC`, `SKIP_RDOC` and	COMMENT
LOW	.github/workflows/build_and_test.yml	1409	# Print the values of environment variables `SKIP_ERRORDOC`, `SKIP_SCALADOC`, `SKIP_PYTHONDOC`, `SKIP_RDOC` and	COMMENT
4 more matches not shown…

Fake / Example Data99 hits · 83 pts

Severity	File	Line	Snippet	Context
LOW	…es/org/apache/spark/ui/static/jquery.dataTables.min.js	4	!function(n){"use strict";var a;"function"==typeof define&&define.amd?define(["jquery"],function(t){return n(t,window,do	CODE
LOW	python/pyspark/sql/tests/test_stat.py	316	dummy_value = 1	CODE
LOW	python/pyspark/sql/tests/test_stat.py	319	.replace({"Alice": "Bob"}, dummy_value)	CODE
LOW⚡	…hon/pyspark/sql/tests/pandas/test_pandas_udf_scalar.py	1110	.withColumn("name", lit("John Doe"))	CODE
LOW	python/pyspark/sql/pandas/functions.py	109	>>> df = spark.createDataFrame([("John Doe",)], ("name",))	STRING
LOW	python/pyspark/sql/pandas/functions.py	124	>>> df = spark.createDataFrame([("John Doe",)], ("name",))	STRING
LOW	python/pyspark/sql/pandas/functions.py	506	>>> df = spark.createDataFrame([("John Doe",)], ("name",))	STRING
LOW	python/pyspark/sql/pandas/functions.py	518	>>> df = spark.createDataFrame([("John Doe",)], ("name",))	STRING
LOW	…apache/spark/graphx/lib/ConnectedComponentsSuite.scala	119	val defaultUser = ("John Doe", "Missing")	CODE
LOW	docs/graphx-programming-guide.md	193	val defaultUser = ("John Doe", "Missing")	CODE
LOW	docs/graphx-programming-guide.md	432	val defaultUser = ("John Doe", "Missing")	CODE
LOW	examples/src/main/python/sql/arrow.py	308	df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))	CODE
LOW	…s/test-data/xml-resources/mixed_children_as_string.xml	4	Lorem ipsum dolor sit amet. Ut <i>voluptas</i> distinctio et impedit deserunt aut quam fugit et quaerat odit	CODE
LOW	…s/test-data/xml-resources/mixed_children_as_string.xml	4	Lorem ipsum dolor sit amet. Ut <i>voluptas</i> distinctio et impedit deserunt aut quam fugit et quaerat odit	CODE
LOW	…/test/resources/test-data/xml-resources/processing.xml	4	lorem ipsum	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	150	INSERT INTO products VALUES (1, 'Super Widget', 'Electronics', 155.99, 99.99, 1, 'Acme Inc', 'John D.', '123 Main St', 2	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	150	INSERT INTO products VALUES (1, 'Super Widget', 'Electronics', 155.99, 99.99, 1, 'Acme Inc', 'John D.', '123 Main St', 2	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	154	INSERT INTO customers VALUES (1, 'Alice Johnson', 'alice@example.com', '555-1000', '101 Maple Ave', NULL, 'Springfield',	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	155	INSERT INTO customers VALUES (2, 'Bob Smith', 'bob@example.com', '555-1002', '202 Oak St', 'Apt 3', 'Oakville', 'CA', '6	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	156	INSERT INTO customers VALUES (3, 'Cathy Lee', 'cathy@example.com', '555-1003', '303 Pine Ln', NULL, 'Pineville', 'TX', '	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	166	INSERT INTO employees VALUES (1, 'Dan Miller', 'dan@example.com', '555-2001', 'Manager', 'Sales', TIMESTAMP '2018-01-01'	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	167	INSERT INTO employees VALUES (2, 'Eva Perez', 'eva@example.com', '555-2002', 'Salesperson', 'Sales', TIMESTAMP '2019-03-	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	168	INSERT INTO employees VALUES (3, 'Frank Wong', 'frank@example.com', '555-2003', 'Warehouse', 'Operations', TIMESTAMP '20	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	170	INSERT INTO suppliers VALUES (1, 'Acme Inc', 'John D.', 'Sales Manager', 'john@acme.com', '555-3001', '555-3002', '123 M	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	170	INSERT INTO suppliers VALUES (1, 'Acme Inc', 'John D.', 'Sales Manager', 'john@acme.com', '555-3001', '555-3002', '123 M	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	170	INSERT INTO suppliers VALUES (1, 'Acme Inc', 'John D.', 'Sales Manager', 'john@acme.com', '555-3001', '555-3002', '123 M	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	171	INSERT INTO suppliers VALUES (2, 'Widgets Co', 'Mary K.', 'Customer Success', 'mary@widgets.com', '555-4001', NULL, '456	CODE
LOW⚡	…-tests/inputs/scripting/randomly_generated_scripts.sql	172	INSERT INTO suppliers VALUES (3, 'Toy Supply', 'Ann T.', 'Director', 'ann@toysupply.com', '555-5001', NULL, '789 Oak St'	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	212	INSERT INTO suppliers VALUES (v_temp_id, 'Temp Supplier', 'Temp Contact', 'Temp Role', 'temp	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	260	INSERT INTO customers VALUES (v_new_customer_id, 'New Customer', 'new@customer.com', '555-1111', '55	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	393	VALUES (sub_emp.employee_id + 9999, v_name_part, CONCAT(v_name_part, '@company.com')	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	405	VALUES (emp.employee_id + 10000, CONCAT('Emp_', emp.employee_id), emp.employee_name, 'Employ	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	474	INSERT INTO products VALUES ((SELECT COALESCE(MAX(product_id), 0) + 1 FROM products), 'Rare ' \|\| v_m	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	519	'555-1212',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	552	'555-1111',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	553	'123 Main St',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	727	'555-7777',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	918	INSERT INTO employees VALUES (v_new_id, 'New Emp ' \|\| v_new_id, 'new' \|\| v_new_id \|\| '@c	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	930	INSERT INTO employees VALUES (v_temp_id, 'Manager ' \|\| v_temp_id, 'manager' \|\| v_temp_id \|\| '@compan	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	941	INSERT INTO employees VALUES (v_low_level_emp + 10, 'Temp Emp ' \|\| v_low_level_emp, 'temp' \|\| v_low_level_em	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	982	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	983	'123 Main St',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1097	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1169	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1417	'123 Main St',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1444	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1491	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1528	'555-0001',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1556	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	1763	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	2072	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	2275	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	2582	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	2615	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	2788	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	2866	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	3112	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	3223	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	3253	'555-0000',	CODE
LOW	…-tests/inputs/scripting/randomly_generated_scripts.sql	3282	'555-0000',	CODE
39 more matches not shown…

Slop Phrases14 hits · 22 pts

Severity	File	Line	Snippet	Context
LOW	…/main/java/org/apache/spark/SparkFirehoseListener.java	27	* This is a concrete Java class in order to ensure that we don't forget to update it when adding	COMMENT
MEDIUM	…/org/apache/spark/storage/BlockReplicationPolicy.scala	101	* Method to prioritize a bunch of candidate peers of a block. This is a basic implementation,	COMMENT
LOW	python/packaging/classic/setup.py	158	# Also don't forget to update python/docs/source/getting_started/install.rst,	COMMENT
LOW	python/packaging/classic/setup.py	158	# Also don't forget to update python/docs/source/getting_started/install.rst,	COMMENT
LOW	python/packaging/classic/setup.py	358	# Don't forget to update python/docs/source/getting_started/install.rst	COMMENT
LOW	python/packaging/connect/setup.py	87	# Also don't forget to update python/docs/source/getting_started/install.rst,	COMMENT
LOW	python/packaging/connect/setup.py	87	# Also don't forget to update python/docs/source/getting_started/install.rst,	COMMENT
LOW	python/packaging/connect/setup.py	117	# Don't forget to update python/docs/source/getting_started/install.rst	COMMENT
LOW	python/packaging/client/setup.py	134	# Also don't forget to update python/docs/source/getting_started/install.rst,	COMMENT
LOW	python/packaging/client/setup.py	134	# Also don't forget to update python/docs/source/getting_started/install.rst,	COMMENT
LOW	python/packaging/client/setup.py	210	# Don't forget to update python/docs/source/getting_started/install.rst	COMMENT
LOW	python/pyspark/pandas/config.py	114	# NOTE: if you are fixing or adding an option here, make sure you execute `show_options()` and	COMMENT
LOW	dev/create-release/release-build.sh	774	# NOTE: Don't forget to update the valid combinations of distributions at	COMMENT
LOW	…/main/scala/org/apache/spark/sql/connect/Dataset.scala	146	// Make sure we don't forget to set plan id.	COMMENT

TODO Padding11 hits · 16 pts

Severity	File	Line	Snippet	Context
LOW	python/pyspark/pandas/groupby.py	2192	# TODO: implement 'dropna' parameter	COMMENT
LOW	…he/spark/ml/classification/JavaGBTClassifierSuite.java	71	// TODO: Add test once save/load are implemented. SPARK-6725	COMMENT
LOW	…ml/classification/JavaDecisionTreeClassifierSuite.java	66	// TODO: Add test once save/load are implemented. SPARK-6725	COMMENT
LOW	…ml/classification/JavaRandomForestClassifierSuite.java	90	// TODO: Add test once save/load are implemented. SPARK-6725	COMMENT
LOW	…park/ml/regression/JavaDecisionTreeRegressorSuite.java	68	// TODO: Add test once save/load are implemented. SPARK-6725	COMMENT
LOW	…park/ml/regression/JavaRandomForestRegressorSuite.java	92	// TODO: Add test once save/load are implemented. SPARK-6725	COMMENT
LOW	…/apache/spark/ml/regression/JavaGBTRegressorSuite.java	72	// TODO: Add test once save/load are implemented. SPARK-6725	COMMENT
LOW	…/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala	66	// TODO: implement weighted bootstrapping	COMMENT
LOW	…rk/sql/execution/datasources/v2/FileDataSourceV2.scala	101	// TODO: implement a light-weight partition inference which only looks at the path of one leaf	COMMENT
LOW	…yst/expressions/aggregate/datasketchesAggregates.scala	152	// TODO: implement support for decimal/datetime/interval types	STRING
LOW	…che/spark/sql/hive/execution/InsertIntoHiveTable.scala	121	// TODO: implement hive compatibility as rules.	COMMENT

AI Response Leakage2 hits · 15 pts

Severity	File	Line	Snippet	Context
HIGH	python/pyspark/sql/tests/connect/test_connect_basic.py	1644	# In this example, the max chunk size is set to a small value, so each Arrow	COMMENT
HIGH	…cala/org/apache/spark/sql/connector/catalog/txns.scala	187	// This is where the table pinning logic should occur. In this implementation, a tables is loaded	COMMENT

Docstring Block Structure1 hit · 5 pts

Severity	File	Line	Snippet	Context
HIGH	python/pyspark/testing/sqlutils.py	115	Read the classpath file for a project and return it as a comma-separated string. The classpath file is typical	STRING

Modern AI Meta-Vocabulary1 hit · 3 pts

Severity	File	Line	Snippet	Context
MEDIUM	…la/org/apache/spark/sql/FileBasedDataSourceSuite.scala	1472	// Embed the nanos leaf inside a struct, an array, and a map value. The guardrails	COMMENT

Synthetic Comment Markers1 hit · 2 pts

Severity	File	Line	Snippet	Context
HIGH	python/pyspark/ml/dl_util.py	103	the empty string, nothing will be written after the auto-generated code.	STRING

Analysis Overview

What These Metrics Mean

Score History

Severity Breakdown

Directory Score Breakdown

Pattern Findings