Repository Analysis

Dao-AILab/flash-attention

Fast and memory-efficient exact attention

13.4 Low AI signal View on GitHub
13.4
Adjusted Score
13.4
Raw Score
100%
Time Factor
2026-05-29
Last Push
23,989
Stars
Python
Language
136,694
Lines of Code
421
Files
1051
Pattern Hits
2026-05-31
Scan Date

Score History

Severity Breakdown

CRITICAL 10HIGH 69MEDIUM 159LOW 813

Pattern Findings

1051 matches across 14 categories. Click a row to expand file-level details.

Decorative Section Separators152 hits · 516 pts
SeverityFileLineSnippet
MEDIUMtools/sass_diff.py24# ── Parsing ──────────────────────────────────────────────────────────────────
MEDIUMtools/sass_diff.py92# ── Diffing ──────────────────────────────────────────────────────────────────
MEDIUMtools/sass_diff.py111# ── Display ──────────────────────────────────────────────────────────────────
MEDIUMtools/sass_diff.py219# ── Main ─────────────────────────────────────────────────────────────────────
MEDIUMtools/ci/run_fa4_ci.py27# ── GPU helpers ───────────────────────────────────────────────────────────────
MEDIUMtools/ci/run_fa4_ci.py60# ── Step plan ─────────────────────────────────────────────────────────────────
MEDIUMtools/ci/run_fa4_ci.py103# ── Step runner ───────────────────────────────────────────────────────────────
MEDIUMtools/ci/run_fa4_ci.py129# ── CLI ───────────────────────────────────────────────────────────────────────
MEDIUMtests/cute/test_mask_mod_varlen.py494# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py496# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py56# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py58# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py95# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py97# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py244# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py246# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py421# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py423# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py619# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py623# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py755# =============================================================================
MEDIUMtests/cute/test_mask_mod_varlen.py757# =============================================================================
MEDIUMtests/cute/test_score_mod_varlen.py70# =============================================================================
MEDIUMtests/cute/test_score_mod_varlen.py72# =============================================================================
MEDIUMtests/cute/test_score_mod_varlen.py178# =============================================================================
MEDIUMtests/cute/test_score_mod_varlen.py180# =============================================================================
MEDIUMtests/cute/test_score_mod_varlen.py401# =============================================================================
MEDIUMtests/cute/test_score_mod_varlen.py403# =============================================================================
MEDIUMtests/cute/score_mod_definitions.py7# =============================================================================
MEDIUMtests/cute/score_mod_definitions.py10# =============================================================================
MEDIUMtests/cute/score_mod_definitions.py485# =============================================================================
MEDIUMtests/cute/score_mod_definitions.py487# =============================================================================
MEDIUMtests/cute/score_mod_definitions.py197# =============================================================================
MEDIUMtests/cute/score_mod_definitions.py201# =============================================================================
MEDIUMtests/cute/test_mask_mod.py838# =============================================================================
MEDIUMtests/cute/test_mask_mod.py842# =============================================================================
MEDIUMtests/cute/test_mask_mod.py1245# =============================================================================
MEDIUMtests/cute/test_mask_mod.py1247# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py14# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py17# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py19# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py21# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py176# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py180# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py277# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py279# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py340# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py344# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py395# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py397# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py499# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py501# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py603# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py605# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py755# =============================================================================
MEDIUMtests/cute/mask_mod_definitions.py757# =============================================================================
MEDIUMtests/cute/test_flash_attn.py2725# ---------------------------------------------------------------------------
MEDIUMtests/cute/test_flash_attn.py2727# ---------------------------------------------------------------------------
MEDIUMtests/cute/test_flash_attn.py2787# ---------------------------------------------------------------------------
MEDIUMtests/cute/test_flash_attn.py2789# ---------------------------------------------------------------------------
92 more matches not shown…
Hyper-Verbose Identifiers331 hits · 320 pts
SeverityFileLineSnippet
LOWsetup.py92def get_cuda_bare_metal_version(cuda_dir):
LOWsetup.py201def validate_and_update_archs(archs):
LOWcsrc/layer_norm/setup.py16def get_cuda_bare_metal_version(cuda_dir):
LOWcsrc/layer_norm/setup.py25def check_cuda_torch_binary_vs_bare_metal(cuda_dir):
LOWcsrc/fused_dense_lib/setup.py10def get_cuda_bare_metal_version(cuda_dir):
LOWhopper/test_attn_kvcache.py155def test_flash_attn_kvcache_nosplit(nheads_kv, gqa_ratio, num_requests, query_seqlen, context_seqlen, headdim, causal, g
LOWhopper/test_attn_kvcache.py292def test_flash_attn_kvcache_output(nheads_kv, gqa_ratio, num_requests, query_seqlen, context_seqlen, headdim, causal, us
LOWhopper/test_torch_compile_and_export.py61def test_compile_and_package_model():
LOWhopper/test_flash_attn_triton_amd.py334def test_flash_attn_varlen_output(
LOWhopper/test_flash_attn_triton_amd.py1042def test_flash_attn_race_condition(seqlen_q, seqlen_k, d, causal, dtype):
LOWhopper/test_util.py9def generate_random_padding_mask(max_seqlen, batch_size, device, mode="random", zero_lengths=False):
LOWhopper/setup.py325def get_cuda_bare_metal_version(cuda_dir):
LOWhopper/test_flash_attn_bwd_determinism.py391def test_flash_attn_varlen_output(
LOWhopper/test_flash_attn.py404def test_flash_attn_varlen_output(
LOWhopper/test_flash_attn.py1133def test_flash_attn_race_condition(seqlen_q, seqlen_k, d, causal, dtype):
LOWhopper/flash_attn_interface.py313def _flash_attn_backward_fake(
LOWhopper/flash_attn_interface.py747def flash_attn_qkvpacked_func(
LOWtraining/src/metrics/num_tokens.py39 def _forward_reduce_state_update(self, *args: Any, **kwargs: Any) -> Any:
LOWtraining/src/callbacks/speed_monitor.py35 def on_validation_epoch_start(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
LOWtraining/src/optim/param_grouping.py15def group_parameters_for_optimizer(model, optimizer_cfg, bias_weight_decay=False,
LOWtraining/src/utils/gpu_affinity.py57def set_single_unique_affinity(gpu_id, nproc_per_node):
LOWtraining/src/utils/gpu_affinity.py80def set_socket_unique_affinity(gpu_id, nproc_per_node, mode):
LOWtraining/src/utils/checkpoint.py32def blockdiag_to_dense_mlp_bert(state_dict):
LOWtraining/src/utils/checkpoint.py41def interpolate_pos_embedding(state_dict, out_seqlen, pos_embedding_name='model.pos_encoder.pe', interleave=False):
LOWtraining/src/utils/ddp_zero1.py24def get_zero_optimizer_state_dict_local(optimizer, global_rank):
LOWtests/test_flash_attn_triton_amd.py44def attn_bias_from_alibi_slopes(
LOWtests/test_flash_attn_triton_amd.py73def generate_random_padding_mask(max_seqlen, batch_size, device, mode="random"):
LOWtests/test_flash_attn_triton_amd.py397def attention_blocksparse_ref(qkv, blockmask, attn_mask, dropout_p, dropout_mask):
LOWtests/test_flash_attn_triton_amd.py601def test_flash_attn_qkvpacked(seqlen, d, dropout_p, causal, local, alibi, deterministic, dtype):
LOWtests/test_flash_attn_triton_amd.py748def test_flash_attn_varlen_qkvpacked(
LOWtests/test_flash_attn_triton_amd.py1191def test_flash_attn_varlen_output(
LOWtests/test_flash_attn_triton_amd.py1619def test_flash_attn_varlen_causal(
LOWtests/test_flash_attn_triton_amd.py2230def test_flash_attn_race_condition(seqlen_q, seqlen_k, d, dropout_p, causal, dtype):
LOWtests/test_flash_attn_triton_amd.py2279def test_flash_attn_bwd_overflow(seqlen, d, causal, dtype):
LOWtests/test_flash_attn_triton_amd.py2336def test_flash_attn_bwd_transpose(seqlen, d, causal, dtype):
LOWtests/test_flash_attn_triton_amd.py2389def test_flash_attn_bwd_varlen_overflow(d, causal, dtype):
LOWtests/test_flash_attn_triton_amd.py2448def test_flash_attn_deterministic(seqlen_q, seqlen_k, swap_sq_sk, d, causal, local, dtype):
LOWtests/test_flash_attn_triton_amd.py2507def test_flash_attn_varlen_deterministic(seqlen_q, seqlen_k, swap_sq_sk, d, causal, local, dtype):
LOWtests/test_rotary.py229def test_rotary_emb_varlen_func(inplace, interleaved, rotary_fraction, seqlen_offsets_type, dtype):
LOWtests/test_flash_attn_ck.py72def get_bwd_unsupported_reason(d, deterministic):
LOWtests/test_flash_attn_ck.py85def ck_randval_to_dropout_mask(randval, p):
LOWtests/test_flash_attn_ck.py92def pad_rearrange_dropout_mask_hts_to_bhss(S_dmask, cu_seqlens_q, seqlen_q_rounded, seqlen_k_rounded):
LOWtests/test_flash_attn_ck.py124def test_flash_attn_qkvpacked(seqlen, d, dropout_p, causal, local, alibi, deterministic, dtype):
LOWtests/test_flash_attn_ck.py222def test_flash_attn_varlen_qkvpacked(seqlen, d, dropout_p, causal, local, alibi, deterministic, dtype):
LOWtests/test_flash_attn_ck.py573def test_flash_attn_varlen_output(
LOWtests/test_flash_attn_ck.py931def test_flash_attn_varlen_causal(
LOWtests/test_flash_attn_ck.py1363def test_flash_attn_race_condition(seqlen_q, seqlen_k, d, dropout_p, causal, dtype):
LOWtests/test_flash_attn_ck.py1410def test_flash_attn_bwd_overflow(seqlen, d, causal, dtype):
LOWtests/test_flash_attn_ck.py1469def test_flash_attn_bwd_transpose(seqlen, d, causal, dtype):
LOWtests/test_flash_attn_ck.py1522def test_flash_attn_bwd_varlen_overflow(d, causal, dtype):
LOWtests/test_flash_attn_ck.py1558def test_flash_attn_bwd_varlen_seqq_zero(d, causal, nheads_kv, deterministic, dtype):
LOWtests/test_flash_attn_ck.py1619def test_flash_attn_deterministic(seqlen_q, seqlen_k, swap_sq_sk, d, causal, local, dtype):
LOWtests/test_flash_attn_ck.py1670def test_flash_attn_varlen_deterministic(seqlen_q, seqlen_k, swap_sq_sk, d, causal, local, dtype):
LOWtests/test_util.py8def generate_random_padding_mask(max_seqlen, batch_size, device, mode="random", zero_lengths=False):
LOWtests/test_flash_attn.py29def attn_bias_from_alibi_slopes(
LOWtests/test_flash_attn.py58def generate_random_padding_mask(max_seqlen, batch_size, device, mode="random"):
LOWtests/test_flash_attn.py382def attention_blocksparse_ref(qkv, blockmask, attn_mask, dropout_p, dropout_mask):
LOWtests/test_flash_attn.py586def test_flash_attn_qkvpacked(seqlen, d, dropout_p, causal, local, alibi, deterministic, dtype):
LOWtests/test_flash_attn.py733def test_flash_attn_varlen_qkvpacked(
LOWtests/test_flash_attn.py1172def test_flash_attn_varlen_output(
271 more matches not shown…
Cross-File Repetition61 hits · 305 pts
SeverityFileLineSnippet
HIGHhopper/test_attn_kvcache.py0arguments: q: (batch_size, seqlen_q, nheads, head_dim) k: (batch_size, seqlen_k, nheads_k, head_dim) v: (batch_size, seq
HIGHtests/test_flash_attn_triton_amd.py0arguments: q: (batch_size, seqlen_q, nheads, head_dim) k: (batch_size, seqlen_k, nheads_k, head_dim) v: (batch_size, seq
HIGHtests/test_util.py0arguments: q: (batch_size, seqlen_q, nheads, head_dim) k: (batch_size, seqlen_k, nheads_k, head_dim) v: (batch_size, seq
HIGHtests/test_flash_attn.py0arguments: q: (batch_size, seqlen_q, nheads, head_dim) k: (batch_size, seqlen_k, nheads_k, head_dim) v: (batch_size, seq
HIGHhopper/test_kvcache.py0use pytorch benchmark on the forward pass of an arbitrary function.
HIGHbenchmarks/benchmark_gemm.py0use pytorch benchmark on the forward pass of an arbitrary function.
HIGHflash_attn/cute/benchmark.py0use pytorch benchmark on the forward pass of an arbitrary function.
HIGHflash_attn/utils/benchmark.py0use pytorch benchmark on the forward pass of an arbitrary function.
HIGHhopper/benchmark_flash_attention_fp8.py0arguments: qkv: (batch_size, seqlen, 3, nheads, head_dim) dropout_p: float output: output: (batch_size, seqlen, nheads,
HIGHbenchmarks/benchmark_causal.py0arguments: qkv: (batch_size, seqlen, 3, nheads, head_dim) dropout_p: float output: output: (batch_size, seqlen, nheads,
HIGHbenchmarks/benchmark_flash_attention.py0arguments: qkv: (batch_size, seqlen, 3, nheads, head_dim) dropout_p: float output: output: (batch_size, seqlen, nheads,
HIGHtraining/src/callbacks/wandb_callbacks.py0start executing this callback only after all validation sanity checks end.
HIGHtraining/src/callbacks/wandb_callbacks.py0start executing this callback only after all validation sanity checks end.
HIGHtraining/src/callbacks/wandb_callbacks.py0start executing this callback only after all validation sanity checks end.
HIGHtraining/src/models/modules/seq_common.py0hidden_states: (b, s, d) if batch_first else (s, b, d)
HIGHtraining/src/models/modules/seq_common.py0hidden_states: (b, s, d) if batch_first else (s, b, d)
HIGHtraining/src/models/modules/seq_common.py0hidden_states: (b, s, d) if batch_first else (s, b, d)
HIGHtraining/src/models/modules/seq_common.py0hidden_states: (b, s, d) if batch_first else (s, b, d)
HIGHtests/test_flash_attn_triton_amd.py0arguments: q: (batch_size, seqlen_q, nheads, d) k: (batch_size, seqlen_k, nheads_k, d) v: (batch_size, seqlen_k, nheads_
HIGHtests/test_util.py0arguments: q: (batch_size, seqlen_q, nheads, d) k: (batch_size, seqlen_k, nheads_k, d) v: (batch_size, seqlen_k, nheads_
HIGHtests/test_flash_attn.py0arguments: q: (batch_size, seqlen_q, nheads, d) k: (batch_size, seqlen_k, nheads_k, d) v: (batch_size, seqlen_k, nheads_
HIGHtests/test_flash_attn_triton_amd.py0we previously had a bug where not masking elements beyond seqlen_k caused nan in dq, in the case where seqlen % 128 != 0
HIGHtests/test_flash_attn_ck.py0we previously had a bug where not masking elements beyond seqlen_k caused nan in dq, in the case where seqlen % 128 != 0
HIGHtests/test_flash_attn.py0we previously had a bug where not masking elements beyond seqlen_k caused nan in dq, in the case where seqlen % 128 != 0
HIGHtests/test_flash_attn_triton_amd.py0we previously had a bug where we were using the wrong strides of dout, which shows up when dout is not contiguous.
HIGHtests/test_flash_attn_ck.py0we previously had a bug where we were using the wrong strides of dout, which shows up when dout is not contiguous.
HIGHtests/test_flash_attn.py0we previously had a bug where we were using the wrong strides of dout, which shows up when dout is not contiguous.
HIGHtests/test_flash_attn_triton_amd.py0we previously had a bug where not masking elements beyond seqlen_k caused nan in dq, in the case where seqlen % 128 != 0
HIGHtests/test_flash_attn_ck.py0we previously had a bug where not masking elements beyond seqlen_k caused nan in dq, in the case where seqlen % 128 != 0
HIGHtests/test_flash_attn.py0we previously had a bug where not masking elements beyond seqlen_k caused nan in dq, in the case where seqlen % 128 != 0
HIGHtests/cute/test_score_mod_varlen.py0tests equality between original and vectorized versions of score mods
HIGHtests/cute/test_score_mod.py0tests equality between original and vectorized versions of score mods
HIGHtests/cute/test_score_mod.py0tests equality between original and vectorized versions of score mods
HIGHtests/models/test_llama.py0check that our implementation matches the hf implementation: the scores in fp16 should be around the same as the hf scor
HIGHtests/models/test_falcon.py0check that our implementation matches the hf implementation: the scores in fp16 should be around the same as the hf scor
HIGHtests/models/test_baichuan.py0check that our implementation matches the hf implementation: the scores in fp16 should be around the same as the hf scor
HIGHflash_attn/cute/benchmark.py0use pytorch benchmark on the forward+backward pass of an arbitrary function.
HIGHflash_attn/cute/benchmark.py0use pytorch benchmark on the forward+backward pass of an arbitrary function.
HIGHflash_attn/cute/benchmark.py0use pytorch benchmark on the forward+backward pass of an arbitrary function.
HIGHflash_attn/utils/benchmark.py0use pytorch benchmark on the forward+backward pass of an arbitrary function.
HIGHflash_attn/utils/benchmark.py0use pytorch benchmark on the forward+backward pass of an arbitrary function.
HIGHflash_attn/utils/benchmark.py0use pytorch benchmark on the forward+backward pass of an arbitrary function.
HIGHflash_attn/cute/flash_fwd.py0configures and launches the flash attention kernel. mq/mk/mv/mo has same data types(supports fp16 and bf16) and same lay
HIGHflash_attn/cute/flash_fwd.py0configures and launches the flash attention kernel. mq/mk/mv/mo has same data types(supports fp16 and bf16) and same lay
HIGHflash_attn/cute/flash_fwd_sm90.py0configures and launches the flash attention kernel. mq/mk/mv/mo has same data types(supports fp16 and bf16) and same lay
HIGHflash_attn/ops/rms_norm.py0residual_in_fp32 only has an effect if residual is none. otherwise residual dtype is residual.dtype.
HIGHflash_attn/ops/rms_norm.py0residual_in_fp32 only has an effect if residual is none. otherwise residual dtype is residual.dtype.
HIGHflash_attn/ops/rms_norm.py0residual_in_fp32 only has an effect if residual is none. otherwise residual dtype is residual.dtype.
HIGHflash_attn/ops/layer_norm.py0residual_in_fp32 only has an effect if residual is none. otherwise residual dtype is residual.dtype.
HIGHflash_attn/ops/layer_norm.py0residual_in_fp32 only has an effect if residual is none. otherwise residual dtype is residual.dtype.
HIGHflash_attn/ops/layer_norm.py0residual_in_fp32 only has an effect if residual is none. otherwise residual dtype is residual.dtype.
HIGHflash_attn/ops/layer_norm.py0assume that arguments are contiguous and aligned to 16 bytes
HIGHflash_attn/ops/layer_norm.py0assume that arguments are contiguous and aligned to 16 bytes
HIGHflash_attn/ops/layer_norm.py0assume that arguments are contiguous and aligned to 16 bytes
HIGHflash_attn/modules/mha.py0implement the scaled dot product attention with softmax. arguments --------- softmax_scale: the temperature to use for t
HIGHflash_attn/modules/mha.py0implement the scaled dot product attention with softmax. arguments --------- softmax_scale: the temperature to use for t
HIGHflash_attn/modules/mha.py0implement the scaled dot product attention with softmax. arguments --------- softmax_scale: the temperature to use for t
HIGHflash_attn/modules/mha.py0implement the scaled dot product attention with softmax. arguments --------- softmax_scale: the temperature to use for t
HIGHflash_attn/modules/mha.py0kv: (batch_size, seqlen, 2, nheads, head_dim) or (batch_size, 1, 2, nheads, head_dim)
HIGHflash_attn/modules/mha.py0kv: (batch_size, seqlen, 2, nheads, head_dim) or (batch_size, 1, 2, nheads, head_dim)
1 more matches not shown…
Unused Imports200 hits · 198 pts
SeverityFileLineSnippet
LOWsetup.py25
LOWcsrc/layer_norm/setup.py2
LOWcsrc/layer_norm/setup.py3
LOWcsrc/layer_norm/setup.py8
LOWcsrc/layer_norm/setup.py9
LOWcsrc/fused_dense_lib/setup.py5
LOWtools/ci/run_fa4_ci.py7
LOWhopper/benchmark_mla_decode.py11
LOWhopper/test_attn_kvcache.py4
LOWhopper/test_attn_kvcache.py6
LOWhopper/test_attn_kvcache.py8
LOWhopper/test_kvcache.py9
LOWhopper/benchmark_flash_attention_fp8.py3
LOWhopper/benchmark_flash_attention_fp8.py7
LOWhopper/benchmark_flash_attention_fp8.py10
LOWhopper/benchmark_flash_attention_fp8.py12
LOWhopper/benchmark_flash_attention_fp8.py12
LOWhopper/benchmark_flash_attention_fp8.py13
LOWhopper/benchmark_flash_attention_fp8.py13
LOWhopper/benchmark_flash_attention_fp8.py15
LOWhopper/benchmark_flash_attention_fp8.py16
LOWhopper/test_flash_attn_triton_amd.py7
LOWhopper/benchmark_attn.py1
LOWhopper/benchmark_attn.py2
LOWhopper/benchmark_attn.py7
LOWhopper/benchmark_attn.py8
LOWhopper/benchmark_attn.py21
LOWhopper/benchmark_attn.py24
LOWhopper/benchmark_attn.py24
LOWhopper/benchmark_attn.py24
LOWhopper/benchmark_attn.py24
LOWhopper/benchmark_attn.py24
LOWhopper/setup.py25
LOWhopper/test_flash_attn_bwd_determinism.py2
LOWhopper/test_flash_attn_bwd_determinism.py7
LOWhopper/test_flash_attn_bwd_determinism.py8
LOWhopper/test_flash_attn_bwd_determinism.py10
LOWhopper/test_flash_attn_bwd_determinism.py16
LOWhopper/test_flash_attn_bwd_determinism.py16
LOWhopper/test_flash_attn_bwd_determinism.py23
LOWhopper/test_flash_attn_bwd_determinism.py24
LOWhopper/test_flash_attn_bwd_determinism.py24
LOWhopper/test_flash_attn.py7
LOWhopper/flash_attn_interface.py3
LOWhopper/flash_attn_interface.py7
LOWhopper/benchmark_split_kv.py5
LOWtraining/tests/datamodules/test_language_modeling_hf.py6
LOWtraining/src/eval.py1
LOWtraining/src/eval.py8
LOWtraining/src/metrics/perplexity.py9
LOWtraining/src/metrics/accuracy.py1
LOWtraining/src/metrics/accuracy.py4
LOWtraining/src/tasks/seq.py1
LOWtraining/src/tasks/seq.py4
LOWtraining/src/distributed/ddp_comm_hooks.py3
LOWtraining/src/distributed/ddp_comm_hooks.py3
LOWtraining/src/callbacks/flop_count.py2
LOWtraining/src/callbacks/flop_count.py8
LOWtraining/src/callbacks/causality_monitor.py2
LOWtraining/src/callbacks/ema.py8
140 more matches not shown…
Hallucination Indicators10 hits · 130 pts
SeverityFileLineSnippet
CRITICALhopper/test_flash_attn_triton_amd.py1140 assert torch.ops.flash_attn_3.fwd.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn_triton_amd.py1153 assert torch.ops.flash_attn_3.bwd.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn_triton_amd.py1161 assert torch.ops.flash_attn_3.fwd_combine.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn_triton_amd.py1165 assert torch.ops.flash_attn_3.get_scheduler_metadata.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn.py1230 assert torch.ops.flash_attn_3.fwd.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn.py1243 assert torch.ops.flash_attn_3.bwd.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn.py1251 assert torch.ops.flash_attn_3.fwd_combine.default._schema.is_backward_compatible_with(parse_schema(
CRITICALhopper/test_flash_attn.py1255 assert torch.ops.flash_attn_3.get_scheduler_metadata.default._schema.is_backward_compatible_with(parse_schema(
CRITICALtests/models/test_btlm.py214 assert model.transformer.embeddings.word_embeddings.weight.mean().abs() < 1e-4
CRITICALtests/models/test_btlm.py216 model.transformer.embeddings.word_embeddings.weight.std()
Deep Nesting119 hits · 111 pts
SeverityFileLineSnippet
LOWsetup.py101
LOWtools/sass_diff.py128
LOWhopper/benchmark_flash_attention_fp8.py34
LOWhopper/test_flash_attn_triton_amd.py628
LOWhopper/benchmark_attn.py76
LOWhopper/setup.py125
LOWhopper/test_flash_attn_bwd_determinism.py110
LOWhopper/test_flash_attn_bwd_determinism.py391
LOWhopper/test_flash_attn.py715
LOWhopper/benchmark_split_kv.py35
LOWtraining/src/train.py32
LOWtraining/src/callbacks/norm_monitor.py33
LOWtraining/src/optim/param_grouping.py15
LOWtraining/src/utils/gpu_affinity.py80
LOWtraining/src/utils/gpu_affinity.py127
LOWtraining/src/utils/ema.py228
LOWtraining/src/utils/distributed.py70
LOWtraining/src/models/modules/seq_common.py15
LOWtests/cute/benchmark_mask_mod.py154
LOWtests/cute/benchmark_mask_mod.py448
LOWtests/cute/test_mask_mod_varlen.py249
LOWtests/cute/test_mask_mod_varlen.py903
LOWtests/cute/test_flash_attn_race_condition.py391
LOWtests/cute/test_score_mod_varlen.py602
LOWtests/cute/test_score_mod_varlen.py950
LOWtests/cute/test_mask_mod.py1905
LOWtests/cute/test_mask_mod.py1951
LOWtests/cute/test_flash_attn.py125
LOWtests/cute/test_flash_attn.py563
LOWtests/cute/test_flash_attn.py1070
LOWtests/cute/test_flash_attn.py2375
LOWtests/cute/test_block_sparsity.py43
LOWtests/cute/test_block_sparsity.py484
LOWbenchmarks/tune_ex2_emu.py33
LOWbenchmarks/tune_ex2_emu.py225
LOWbenchmarks/benchmark_attn.py361
LOWbenchmarks/bench_sm90.py334
LOWbenchmarks/bench_sm90.py367
LOWbenchmarks/bench_sm90.py397
LOWbenchmarks/bench_sm90.py452
LOWbenchmarks/bench_sm90.py489
LOWflash_attn/flash_attn_triton.py66
LOWflash_attn/flash_attn_triton.py365
LOWflash_attn/flash_attn_interface.py31
LOWflash_attn/cute/sm90_config_search.py174
LOWflash_attn/cute/sm90_config_search.py315
LOWflash_attn/cute/mask.py44
LOWflash_attn/cute/mask.py145
LOWflash_attn/cute/mask.py463
LOWflash_attn/cute/mask.py580
LOWflash_attn/cute/mask.py742
LOWflash_attn/cute/mask.py1406
LOWflash_attn/cute/mask.py1548
LOWflash_attn/cute/mask.py1623
LOWflash_attn/cute/sm100_hd256_2cta_fmha_forward.py554
LOWflash_attn/cute/sm100_hd256_2cta_fmha_forward.py1525
LOWflash_attn/cute/ampere_helpers.py35
LOWflash_attn/cute/flash_bwd.py481
LOWflash_attn/cute/flash_bwd.py1165
LOWflash_attn/cute/flash_bwd.py1194
59 more matches not shown…
Over-Commented Block95 hits · 93 pts
SeverityFileLineSnippet
LOWcsrc/flash_attn_ck/flash_common.hpp1/******************************************************************************
LOWcsrc/layer_norm/static_switch.h1// Inspired by https://github.com/NVIDIA/DALI/blob/main/include/dali/core/static_switch.h
LOWcsrc/fused_dense_lib/fused_dense.cpp1// Adapted from https://github.com/NVIDIA/apex/blob/master/csrc/fused_dense.cpp
LOWcsrc/flash_attn/flash_api.cpp1/******************************************************************************
LOWcsrc/flash_attn/src/flash_fwd_kernel.h1/******************************************************************************
LOWcsrc/flash_attn/src/flash_fwd_kernel.h221 // for (int i = 0; i < size(tScQ); ++i) {
LOWcsrc/flash_attn/src/utils.h1/******************************************************************************
LOWcsrc/flash_attn/src/utils.h321 cute::clear(D(_, m, _));
LOWcsrc/flash_attn/src/utils.h341 // if (Is_even_MN || get<0>(identity_MN(0, m, 0)) < max_MN) {
LOWcsrc/flash_attn/src/flash_fwd_launch_template.h1/******************************************************************************
LOWcsrc/flash_attn/src/flash_bwd_launch_template.h1/******************************************************************************
LOWcsrc/flash_attn/src/flash_bwd_launch_template.h21#define KERNEL_PARAM_MODIFIER __grid_constant__
LOWcsrc/flash_attn/src/flash_bwd_launch_template.h181 // run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 128, 128, 8, 4, 4, 4, true, false, T>, Is_dropout>(params,
LOWcsrc/flash_attn/src/flash_bwd_launch_template.h241 // printf("max_smem_per_block = %d\n", max_smem_per_block);
LOWcsrc/flash_attn/src/dropout.h41 uint2 rowcol = make_uint2(block_row_start, block_col_start);
LOWcsrc/flash_attn/src/flash_bwd_kernel.h1/***************************************************************************************************
LOWcsrc/flash_attn/src/flash_bwd_kernel.h321 // If not local, we're guaranteed that m_block_min <= m_block:
LOWcsrc/flash_attn/src/static_switch.h1// Inspired by
LOWtools/ci/build_sif.sh1#!/usr/bin/env bash
LOWhopper/mainloop_fwd_sm90_tma_gmma_ws.hpp1/******************************************************************************
LOWhopper/utils.h1/******************************************************************************
LOWhopper/mainloop_bwd_sm90_tma_gmma_ws.hpp1/******************************************************************************
LOWhopper/benchmark_mla_decode.py121 print(f"Arithmetic intensity: {flops / mem_io:.1f}")
LOWhopper/mainloop_bwd_sm80.hpp1/******************************************************************************
LOWhopper/mainloop_bwd_sm80.hpp561 #pragma unroll
LOWhopper/mainloop_bwd_sm80.hpp621 // Instead of passing in tQcQ, we pass in t0QcQ and subtract the offset from the limit
LOWhopper/flash_fwd_launch_template.h1/******************************************************************************
LOWhopper/benchmark_flash_attention_fp8.py321 # )()
LOWhopper/test_flash_attn_triton_amd.py61@pytest.mark.parametrize("deterministic", [False])
LOWhopper/test_flash_attn_triton_amd.py221 and dtype != torch.float8_e4m3fn
LOWhopper/test_flash_attn_triton_amd.py241 # causal,
LOWhopper/test_flash_attn_triton_amd.py501 ):
LOWhopper/test_flash_attn_triton_amd.py581# @pytest.mark.parametrize("mha_type", ["mha"])
LOWhopper/test_flash_attn_triton_amd.py901 out = output_pad_fn(out)
LOWhopper/tile_scheduler.hpp641 // Total number of blocks for the next 31 batches
LOWhopper/tile_scheduler.hpp741 int split_idx = bidh - bidh_actual * num_splits;
LOWhopper/benchmark_attn.py41def time_fwd(func, *args, repeats=30, verbose=True, desc="", **kwargs):
LOWhopper/benchmark_attn.py241# bs_seqlen_vals = [(32, 512), (16, 1024)]
LOWhopper/benchmark_attn.py401 # print(time_f)
LOWhopper/flash_bwd_launch_template.h1/******************************************************************************
LOWhopper/test_flash_attn_bwd_determinism.py61# @pytest.mark.parametrize("mha_type", ["mqa"])
LOWhopper/test_flash_attn_bwd_determinism.py341# @pytest.mark.parametrize("dtype", [torch.float8_e4m3fn])
LOWhopper/test_flash_attn_bwd_determinism.py401 # batch_size = 40
LOWhopper/static_switch.h1// Inspired by
LOWhopper/mainloop_fwd_sm80.hpp1/******************************************************************************
LOWhopper/test_flash_attn.py121# @pytest.mark.parametrize("has_qv", [True])
LOWhopper/test_flash_attn.py301 # k,
LOWhopper/test_flash_attn.py361@pytest.mark.parametrize("softcap", [0.0] + ([15.0] if not DISABLE_SOFTCAP else []))
LOWhopper/test_flash_attn.py601 # None,
LOWhopper/test_flash_attn.py621 dv.masked_fill_(k_zero_masking, 0.0)
LOWhopper/flash_fwd_kernel_sm90.h1/******************************************************************************
LOWhopper/flash_api.cpp1221 #ifndef FLASHATTENTION_DISABLE_HDIM256
LOWhopper/benchmark_split_kv.py121 causal=causal,
LOWhopper/flash_api_stable.cpp1/******************************************************************************
LOWhopper/flash_api_stable.cpp541 #endif
LOWhopper/flash_api_stable.cpp1241 if (out_type == torch::headeronly::ScalarType::BFloat16) {
LOWhopper/flash_api_stable.cpp1281 if (params.d_rounded == 64) { return run_mha_bwd_<Arch, cutlass::half_t, 64, Has_softcap>(params, stream); }
LOWhopper/flash_api_stable.cpp1301 #endif
LOWtraining/configs/experiment/owt/gpt2xl-flash.yaml1# @package _global_
LOWtraining/configs/experiment/owt/gpt2l-flash.yaml1# @package _global_
35 more matches not shown…
Verbosity Indicators32 hits · 54 pts
SeverityFileLineSnippet
LOWhopper/flash_fwd_combine_kernel.h229 // Step 1: load LSE_partial from gmem -> smem
LOWhopper/flash_fwd_combine_kernel.h274 // Step 2: Load O_partial from gmem -> smem for split = 0, 1, ..., kStages - 2.
LOWhopper/flash_fwd_combine_kernel.h335 // Step 3: load and transpose LSE_partial from smem -> rmem
LOWhopper/flash_fwd_combine_kernel.h345 // Step 4: compute the final LSE along the split dimension
LOWhopper/flash_fwd_combine_kernel.h394 // Step 5: store final LSE back to gmem
LOWhopper/flash_fwd_combine_kernel.h417 // Step 6: read O_partial from gmem -> smem -> rmem and accumulate the final O
LOWhopper/flash_fwd_combine_kernel.h460 // Step 7: Write the final O to gmem
LOWhopper/flash_bwd_postprocess_kernel.h174 // Step 1: load dQaccum from gmem to smem
LOWhopper/flash_bwd_postprocess_kernel.h200 // Step 2: Load dQaccum from smem to register, then convert fp32 -> fp16/bf16
LOWhopper/flash_bwd_postprocess_kernel.h218 // Step 3: Copy dQ from register to smem
LOWhopper/flash_bwd_postprocess_kernel.h229 // Step 4: Copy dQ from smem to register to prepare for coalesced write to gmem
LOWhopper/flash_bwd_postprocess_kernel.h247 // Step 5: Copy dQ from register to gmem
LOWhopper/epilogue_fwd.hpp251 // Step 1: Write O from rmem -> smem
LOWhopper/epilogue_fwd.hpp281 // Step 2: Write LSE from rmem -> gmem
LOWhopper/epilogue_fwd.hpp310 // Step 3: Write O from smem -> gmem
LOWflash_attn/cute/flash_bwd.py251 # Do we need to check if we overshot kBlockM when we load Q?
LOWflash_attn/cute/flash_bwd.py253 # Do we need to check if we overshot kBlockN when we load K?
LOWflash_attn/cute/flash_bwd.py260 # Do we need to check if we overshot kBlockN when we load V?
LOWflash_attn/cute/flash_bwd_postprocess.py492 # Step 1: load dQaccum from gmem to smem
LOWflash_attn/cute/flash_bwd_postprocess.py501 # Step 2: load dQ from smem to rmem
LOWflash_attn/cute/flash_bwd_postprocess.py534 # Step 3: Copy dQ from register to smem
LOWflash_attn/cute/flash_bwd_postprocess.py568 # Step 4: Copy dQ from smem to register to prepare for coalesced write to gmem
LOWflash_attn/cute/flash_bwd_postprocess.py577 # Step 5: Copy dQ from register to gmem
LOWflash_attn/cute/flash_fwd.py495 # Do we need to check if we overshoot kBlockN when we load K?
LOWflash_attn/cute/flash_fwd.py541 # Do we need to check if we overshoot kBlockN when we load V?
LOWflash_attn/cute/flash_fwd_combine.py405 # Step 1: Load LSE_partial from gmem to shared memory
LOWflash_attn/cute/flash_fwd_combine.py442 # Step 2: Load O_partial for pipeline stages
LOWflash_attn/cute/flash_fwd_combine.py495 # Step 3: Load and transpose LSE from smem to registers
LOWflash_attn/cute/flash_fwd_combine.py513 # Step 4: Compute final LSE along split dimension
LOWflash_attn/cute/flash_fwd_combine.py573 # Step 5: Store final LSE to gmem
LOWflash_attn/cute/flash_fwd_combine.py595 # Step 6: Read O_partial and accumulate final O
LOWflash_attn/cute/flash_fwd_combine.py642 # Step 7: Write final O to gmem
Cross-Language Confusion8 hits · 38 pts
SeverityFileLineSnippet
HIGHhopper/setup.py295 blocks.append(cuda_compile_rule) # type: ignore[possibly-undefined]
HIGHhopper/setup.py296 blocks.append(cuda_compile_rule_sm80) # type: ignore[possibly-undefined]
HIGHhopper/setup.py297 blocks.append(cuda_compile_rule_sm80_sm90) # type: ignore[possibly-undefined]
HIGHhopper/setup.py298 blocks.append(cuda_compile_rule_sm100) # type: ignore[possibly-undefined]
HIGHtests/cute/test_flash_attn_combine.py183 # Only compare valid positions (beyond seqused, output is undefined)
HIGHAI/parse_clc_log.py246let selectedSm = null;
HIGHAI/parse_clc_log.py309 if (id === query || id.includes(query)) {{
HIGHAI/parse_clc_log.py326 selectedSm = null;
AI Slop Vocabulary11 hits · 19 pts
SeverityFileLineSnippet
LOWhopper/generate_kernels.py134 # so we should just pass in packgqa=False to avoid the `_packgqa` in the filename.
LOWtests/cute/score_mod_definitions.py478 # Don't read from aux_tensors at all - just add the global index as bias
MEDIUMtests/cute/test_mask_mod.py6# (identity, document, block_diagonal, etc.) with comprehensive seqlen coverage
LOWflash_attn/flash_attn_triton.py145 # [2022-10-30] TD: Triton bug - in the case of EVEN_M=True and EVEN_N=False, if we just call
LOWflash_attn/flash_attn_triton.py347 # if we just call tl.store(dv_ptrs), there's a race condition
LOWflash_attn/flash_attn_triton.py442 # if we just call tl.load(k_ptrs), we get the wrong output!
LOWflash_attn/flash_attn_triton.py521 # [2022-11-01] TD: Triton bug, there's a race condition if we just use m_mask and not d_mask.
LOWflash_attn/cute/flash_bwd_postprocess.py187 # We can't just use kHeadDim here. E.g. if MMA shape is 64 x 96 but split across 2 WGs,
MEDIUMflash_attn/cute/flash_fwd_sm100.py701 # CLC buffers placed here to utilize padding before sO's 1024-byte alignment.
LOWflash_attn/cute/flash_fwd_sm90.py1390 # 2 elements. So we just call ptx directly.
LOWflash_attn/cute/flash_fwd_sm90.py1462 # 2 elements. So we just call ptx directly.
Excessive Try-Catch Wrapping17 hits · 18 pts
SeverityFileLineSnippet
LOWsetup.py186 except Exception as e:
MEDIUMsetup.py180def detect_hipify_v2():
LOWtests/cute/benchmark_block_sparsity.py83 except Exception as e:
LOWtests/cute/benchmark_block_sparsity.py190 except Exception as e:
LOWtests/cute/benchmark_block_sparsity.py375 except Exception as e:
LOWbenchmarks/tune_ex2_emu.py307 except Exception as e:
LOWbenchmarks/tune_ex2_emu.py370 except Exception as e:
LOWbenchmarks/bench_sm90.py126 except Exception as e:
LOWbenchmarks/bench_sm90.py165 except Exception as e:
LOWbenchmarks/bench_sm90.py175 except Exception as e:
LOWbenchmarks/benchmark_flash_attention.py119 except Exception:
LOWbenchmarks/benchmark_flash_attention.py134 except Exception:
LOWbenchmarks/benchmark_flash_attention.py141 except Exception:
LOWflash_attn/cute/cute_dsl_ptxas.py93 except Exception as e:
LOWflash_attn/cute/benchmark_flash_attention_fp8.py330 except Exception as e:
LOWflash_attn/cute/benchmark_flash_attention_fp8.py402 except Exception as e:
LOWflash_attn/cute/utils.py77 except Exception:
Self-Referential Comments4 hits · 12 pts
SeverityFileLineSnippet
MEDIUMhopper/generate_kernels.py3# This file is run to generate the kernel instantiations for the flash_attn kernels
MEDIUMtests/cute/test_flash_attn_combine.py254 # Create a permuted batch index mapping: virtual batch -> real batch
MEDIUMtests/models/test_llama.py578 # Create a shared test model.
MEDIUMflash_attn/modules/embedding.py137 # Create a mask of valid vocab ids (1 means it needs to be masked).
Redundant / Tautological Comments8 hits · 11 pts
SeverityFileLineSnippet
LOWsetup.py463 # Check if torch is using hipify v2. Until CK is updated with HIPIFY_V2 macro,
LOWhopper/setup.py384 # Set timeout to 300 seconds to prevent the request from hanging forever.
LOWtests/cute/test_utils.py204 # Set __cute_hash__ to simulate Inductor-generated code
LOWtests/cute/test_block_sparsity.py111 # Check if ref skipped it entirely (all masked)
LOWflash_attn/cute/flash_bwd.py133 # Check if block size setting is out of shared memory capacity
LOWflash_attn/cute/compute_block_sparsity.py374 # Check if mask_mod is marked as suitable for 5-point sampling
LOWflash_attn/cute/flash_fwd.py158 # Check if block size setting is out of shared memory capacity
LOWflash_attn/cute/flash_fwd.py171 # Check if twice the block size is divisible by the number of threads
Example Usage Blocks3 hits · 4 pts
SeverityFileLineSnippet
LOWtools/ci/build_sif.sh4# Usage:
LOWtests/cute/test_mask_mod_varlen.py8# Usage:
LOWtests/cute/test_mask_mod.py10# Usage: