A comprehensive framework for implementing, experimenting with, and benchmarking sparse attention mechanisms in transformer models. This repository provides a unified interface for various sparse attention algorithms, seamless integration with HuggingFace Transformers, and extensive benchmarking capabilities across multiple long-context evaluation datasets.
sparse-attention-hub/
βββ sparse_attention_hub/ # Core package
β βββ adapters/ # Model integration adapters
β β βββ huggingface.py # HuggingFace Transformers integration
β β βββ README.md # Adapter documentation
β βββ sparse_attention/ # Sparse attention implementations
β β βββ research_attention/ # Research-focused attention mechanisms
β β β βββ maskers/ # Masker implementations
β β β β βββ fixed/ # Fixed pattern maskers
β β β β βββ sampling/ # Sampling-based maskers
β β β βββ README.md # Research attention documentation
β β βββ efficient_attention/ # Production-optimized attention
β βββ metric_logging/ # Micro metric logging
βββ benchmark/ # Benchmarking suite
β βββ raytune/ # Ray Tune optimization framework
β β βββ README.md # Optimization documentation
β βββ longbench/ # LongBench evaluation
β βββ infinite_bench/ # InfiniteBench evaluation
β βββ ruler/ # RULER evaluation
β βββ zero_scrolls/ # Zero Scrolls evaluation
β βββ loogle/ # Loogle evaluation
β βββ AIME2025/ # AIME 2025 mathematical reasoning
β βββ executor.py # Main benchmark executor
βββ tests/ # Comprehensive test suite
βββ tutorials/ # Usage tutorials and examples
βββ scripts/ # Utility scripts
A Mask object represents attention patterns that control which tokens can attend to each other. The framework supports two main representations:
- Dense Representation: Full tensor of shape
(batch_size, num_heads, seq_len_queries, seq_len_keys) - Sparse Representation: Compressed format using indices and pointer arrays for memory efficiency
Special masks:
- Empty Mask: All elements are 0.0 (no attention connections)
- Full Mask: All elements are 1.0 (dense attention, memory-optimized)
A Masker is a component that applies specific masking logic to attention computation. Each masker implements the add_mask() method which:
- Takes attention tensors (queries, keys, values) and a previous mask
- Applies its specific masking logic, adding more active elements to the mask
- Returns a new mask that can be further processed by subsequent maskers
Key Concept: Maskers are additive - they add attention connections to the existing mask rather than replacing it entirely. This allows for composition of different attention patterns.
For detailed information about masks and maskers, see the Research Attention README.
The framework provides a flexible configuration system for creating sparse attention mechanisms. You can combine multiple maskers to create complex attention patterns:
from sparse_attention_hub.sparse_attention.research_attention import ResearchAttentionConfig
from sparse_attention_hub.sparse_attention.research_attention.maskers.fixed.implementations import (
SinkMaskerConfig,
LocalMaskerConfig
)
# Create a basic sparse attention configuration
config = ResearchAttentionConfig(
masker_configs=[
SinkMaskerConfig(sink_size=128), # Keep first 128 tokens
LocalMaskerConfig(window_size=256) # Local attention window
]
)The framework supports various state-of-the-art sparse attention mechanisms:
- HashAttention (Desai et al. 2024): Hash-based attention selection
- vAttention (Desai et al. 2025): Adaptive sampling mechanisms
- MagicPig (Chen et al. 2024): LSH-based similarity sampling
- Oracle-based methods: Research-only mechanisms using ground truth attention
For comprehensive examples and detailed masker implementations, see the Research Attention README.
The framework includes an optimization system using Ray Tune for hyperparameter search:
python3 benchmark/raytune/run_optimize_configs.py \
--objective sparsity_10 \
--optimal-configs-dir <base_dir> \
--num-samples 1 \
--search-max-new-tokens 5 \
--search-max-context-length 32768 \
--search-max-requests 2 \
--actors-per-gpu 1python3 benchmark/raytune/run_config_dir.py \
--configs-dir <base_dir/config_dir> \
--max-new-tokens 100 \
--max-context-length 32768 \
--max-requests 2 \
--actors-per-gpu 1 \
--benchmark-results-dir ./benchmark_results/The optimization system supports:
- Distributed Execution: Ray-based parallel processing across multiple GPUs
- Automatic Resource Management: Efficient GPU utilization and task scheduling
- Comprehensive Metrics: Detailed performance and accuracy measurements
- Search Space Definition: Customizable hyperparameter search spaces
For detailed optimization documentation, see the Ray Tune README.
The framework provides a comprehensive benchmarking system that can evaluate sparse attention configurations across multiple datasets:
from benchmark.executor import BenchmarkExecutor
from benchmark.executor_config import BenchmarkConfig, AdapterConfig
# Define your models and configurations
models = ["meta-llama/Llama-3.2-1B-Instruct"]
sparse_configs = [
("dense", None), # Dense baseline
("sparse", your_sparse_config) # Your sparse configuration
]
# Define benchmarks
benchmarks = [
BenchmarkConfig(benchmark_name="longbench", subsets=["narrativeqa"]),
BenchmarkConfig(benchmark_name="ruler", subsets=["4096"]),
BenchmarkConfig(benchmark_name="infinite_bench", subsets=["passkey"])
]
# Run benchmarks
executor = BenchmarkExecutor(
gpu_ids=[0, 1, 2],
max_concurrent_runs=3,
base_result_dir="./results"
)
results = executor.run_benchmark_matrix(
model_names=models,
sparse_attention_configs=sparse_configs,
benchmark_configs=benchmarks,
adapter_config=AdapterConfig()
)# Run a minimal benchmark
python benchmark/scripts/benchmark.py
# Run full benchmarking suite
python benchmark/scripts/full_benchmarking/full_benchmark.pyThe framework supports a comprehensive suite of long-context evaluation benchmarks:
| Benchmark | Description | Context Length | Tasks |
|---|---|---|---|
| LongBench | Long-context understanding | Up to 100K tokens | 6 tasks (narrative QA, summarization, etc.) |
| LongBench-v2 | Extended long-context evaluation | Up to 100K tokens | Enhanced version of LongBench |
| InfiniteBench | Infinite context evaluation | Up to 1M+ tokens | 12 major tasks including passkey retrieval |
| RULER | Synthetic long-context evaluation | 4K-128K tokens | 13 tasks in 4 categories (needle-in-haystack, QA, etc.) |
| Zero Scrolls | Multi-domain evaluation | Variable | 10 tasks across summarization, QA, sentiment |
| Loogle | Short and long dependency understanding | Variable | 7 major tasks |
| AIME 2025 | Mathematical reasoning | Variable | 30 competition problems |
- HuggingFace Integration: All benchmarks use processed HuggingFace datasets
- Automatic Evaluation: Built-in metrics calculation and result aggregation
- Resumability: Skip completed experiments and resume from interruptions
- Parallel Execution: Multi-GPU support with dynamic resource allocation
- Comprehensive Logging: Detailed performance and accuracy metrics
import torch
from sparse_attention_hub.adapters import ModelAdapterHF, Request
from sparse_attention_hub.sparse_attention.research_attention import ResearchAttentionConfig
from sparse_attention_hub.sparse_attention.research_attention.maskers.fixed.implementations import (
SinkMaskerConfig,
LocalMaskerConfig
)
# 1. Create sparse attention configuration
sparse_config = ResearchAttentionConfig(
masker_configs=[
SinkMaskerConfig(sink_size=128),
LocalMaskerConfig(window_size=256)
]
)
# 2. Initialize adapter
adapter = ModelAdapterHF(
model_name="meta-llama/Llama-3.2-1B",
sparse_attention_config=sparse_config,
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda"
)
# 3. Process requests
request = Request(
context="The capital of France is Paris. It is known for the Eiffel Tower.",
questions="What is the capital of France?",
answer_prefix="Answer: "
)
response = adapter.process_request(
request=request,
generation_kwargs={"max_new_tokens": 50},
request_kwargs={"max_context_length": 1024}
)
print(response.responses) # "Answer: The capital of France is Paris."# Clone the repository
git clone https://github.com/xAlg-ai/sparse-attention-hub.git
cd sparse-attention-hub
# Install the package
pip install -e .
# Install development dependencies
pip install -e ".[dev]"# Run all tests
pytest
# Run specific test categories
pytest -m unit # Unit tests
pytest -m integration # Integration tests
- Adapters Module - Model integration and HuggingFace support
- Research Attention - Sparse attention mechanisms and maskers
- Ray Tune Optimization - Hyperparameter optimization and search