Memray: Full-Stack Memory Profiling for Python and Native Extensions
Overview
Memray is Bloomberg's open-source memory profiler that tracks every allocation across the entire call stack — Python code, C/C++ native extensions, and the CPython interpreter itself. Unlike sampling profilers that approximate hotspot detection, Memray hooks into every malloc, realloc, and free call, producing a complete, deterministic record of memory behavior. For teams building data pipelines, ML inference services, or performance-sensitive Python applications, it bridges a critical observability gap: what is using memory and exactly which code path caused it.
Key Features
- Full call-stack tracing — captures every allocation's complete trace, not a statistical sample, giving deterministic accuracy
- Native C/C++/Rust extension awareness — using
--native, it visualizes allocations inside libraries like NumPy, pandas, or custom CPython modules alongside Python frames, rendered in distinct colors in flame graphs - Live TUI mode — attach to a running process and inspect allocation patterns in real time with keyboard-navigable, sortable views per thread
- Multi-reporter output — flame graphs (HTML), tree views, summary tables, stats, and a pytest plugin (
pytest-memray) that can fail tests exceeding configurable memory limits - Process attachment — inject the tracker into an already-running Python process (PID-based) without restarting the application, using GDB/LLDB under the hood
Technical Architecture
Memray works by intercepting every memory allocation at the operating system level. On Linux, it uses LD_PRELOAD to inject a shared library that wraps malloc, calloc, realloc, posix_memalign, and free. On macOS, it uses DYLD_INSERT_LIBRARIES. Each intercepted call records the allocation size, address, and the current instruction pointer, then unwinds the stack using libunwind (Linux) or the macOS frame-pointer walk.
The stack frames are stored as raw instruction pointers — symbolification is deliberately deferred to report-generation time, not collection time. This minimizes profiling overhead (typically 10-30% slowdown for Python-only, higher with --native). At analysis time, Memray converts IPs to human-readable symbols using DWARF debug info when available, falling back to ELF symbol tables. It also integrates with debuginfod to fetch debug symbols from remote servers, which is essential when distribution packages lack debuginfo.
A key architectural decision: Memray uses the Tracker context manager API (with memray.Tracker("out.bin")) internally even in CLI mode. This makes the same engine available as a Python library for fine-grained, programmatic profiling of specific code regions.
Use Cases
- Diagnosing memory leaks in production-like workloads — the flame graph reporter makes it trivial to spot functions whose allocation volume grows monotonically
- Optimizing data pipelines — identify temporary allocations (e.g., unnecessary intermediate arrays in NumPy) that cause GC pressure and throughput degradation
- CI/CD memory regression detection — integrate
pytest-memraywith@pytest.mark.limit_memory("50 MB")to catch regressions before they ship - Debugging C extension memory bugs — native mode shows exactly where a CPython extension leaks, including frames inside the compiled
.so
Pros & Cons
Pros: Complete, deterministic trace; native extension visibility unmatched by alternatives; fast enough for real workloads; rich visualization ecosystem; excellent pytest integration for CI.
Cons: Linux/macOS only (no Windows support); native mode overhead is higher (50-100% slowdown); symbolification requires debug symbols on the same machine; attaching to a running process carries crash risk on edge cases; no built-in heap diffing across snapshots.
Alternatives
- tracemalloc (stdlib) — lighter but sampling-only, no native extension visibility
- memory_profiler — line-by-line decorator-based, but high overhead and no native frames
- Fil (pythonspeed.com) — commercial, deterministic, but proprietary and macOS-only
- Valgrind (massif) — extremely thorough but 5-20x slowdown, impractical for real-time use
- py-spy + psutil — sampling approach, good for rough trends but no allocation-level insight
Memray occupies a unique niche: deterministic full-stack tracing with native extension support, in a package you can pip install without a PhD in tooling.
Who Should Use It
Python teams shipping performance-sensitive applications: ML inference services using NumPy/TensorFlow, data engineering pipelines, web services with C extensions, or anyone profiling memory in CI/CD. Requires Python 3.7+ on Linux (preferred) or macOS. System dependencies include libunwind and liblz4 (automatically handled via binary wheels on most platforms).
Getting Started
pip install memray
memray run -o output.bin my_script.py
memray flamegraph output.bin
# Opens memray-flamegraph-my_script.html in browser
For native mode:
memray run --native -o output.bin my_script.py
For CI integration, add pytest-memray and annotate tests:
@pytest.mark.limit_memory("50 MB")
def test_allocation_bound():
result = expensive_function()
assert validate(result)