2026.01.16-bloomberg-memray

Memray: Full-Stack Memory Profiling for Python and Native Extensions

Overview

Memray is Bloomberg's open-source memory profiler that tracks every allocation across the entire call stack — Python code, C/C++ native extensions, and the CPython interpreter itself. Unlike sampling profilers that approximate hotspot detection, Memray hooks into every malloc, realloc, and free call, producing a complete, deterministic record of memory behavior. For teams building data pipelines, ML inference services, or performance-sensitive Python applications, it bridges a critical observability gap: what is using memory and exactly which code path caused it.

Key Features

Full call-stack tracing — captures every allocation's complete trace, not a statistical sample, giving deterministic accuracy
Native C/C++/Rust extension awareness — using --native, it visualizes allocations inside libraries like NumPy, pandas, or custom CPython modules alongside Python frames, rendered in distinct colors in flame graphs
Live TUI mode — attach to a running process and inspect allocation patterns in real time with keyboard-navigable, sortable views per thread
Multi-reporter output — flame graphs (HTML), tree views, summary tables, stats, and a pytest plugin (pytest-memray) that can fail tests exceeding configurable memory limits
Process attachment — inject the tracker into an already-running Python process (PID-based) without restarting the application, using GDB/LLDB under the hood

Technical Architecture

Memray works by intercepting every memory allocation at the operating system level. On Linux, it uses LD_PRELOAD to inject a shared library that wraps malloc, calloc, realloc, posix_memalign, and free. On macOS, it uses DYLD_INSERT_LIBRARIES. Each intercepted call records the allocation size, address, and the current instruction pointer, then unwinds the stack using libunwind (Linux) or the macOS frame-pointer walk.

The stack frames are stored as raw instruction pointers — symbolification is deliberately deferred to report-generation time, not collection time. This minimizes profiling overhead (typically 10-30% slowdown for Python-only, higher with --native). At analysis time, Memray converts IPs to human-readable symbols using DWARF debug info when available, falling back to ELF symbol tables. It also integrates with debuginfod to fetch debug symbols from remote servers, which is essential when distribution packages lack debuginfo.

A key architectural decision: Memray uses the Tracker context manager API (with memray.Tracker("out.bin")) internally even in CLI mode. This makes the same engine available as a Python library for fine-grained, programmatic profiling of specific code regions.

Use Cases

Diagnosing memory leaks in production-like workloads — the flame graph reporter makes it trivial to spot functions whose allocation volume grows monotonically
Optimizing data pipelines — identify temporary allocations (e.g., unnecessary intermediate arrays in NumPy) that cause GC pressure and throughput degradation
CI/CD memory regression detection — integrate pytest-memray with @pytest.mark.limit_memory("50 MB") to catch regressions before they ship
Debugging C extension memory bugs — native mode shows exactly where a CPython extension leaks, including frames inside the compiled .so

Pros & Cons

Pros: Complete, deterministic trace; native extension visibility unmatched by alternatives; fast enough for real workloads; rich visualization ecosystem; excellent pytest integration for CI.

Cons: Linux/macOS only (no Windows support); native mode overhead is higher (50-100% slowdown); symbolification requires debug symbols on the same machine; attaching to a running process carries crash risk on edge cases; no built-in heap diffing across snapshots.

Alternatives

tracemalloc (stdlib) — lighter but sampling-only, no native extension visibility
memory_profiler — line-by-line decorator-based, but high overhead and no native frames
Fil (pythonspeed.com) — commercial, deterministic, but proprietary and macOS-only
Valgrind (massif) — extremely thorough but 5-20x slowdown, impractical for real-time use
py-spy + psutil — sampling approach, good for rough trends but no allocation-level insight

Memray occupies a unique niche: deterministic full-stack tracing with native extension support, in a package you can pip install without a PhD in tooling.

Who Should Use It

Python teams shipping performance-sensitive applications: ML inference services using NumPy/TensorFlow, data engineering pipelines, web services with C extensions, or anyone profiling memory in CI/CD. Requires Python 3.7+ on Linux (preferred) or macOS. System dependencies include libunwind and liblz4 (automatically handled via binary wheels on most platforms).

Getting Started

pip install memray
memray run -o output.bin my_script.py
memray flamegraph output.bin
# Opens memray-flamegraph-my_script.html in browser

For native mode:

memray run --native -o output.bin my_script.py

For CI integration, add pytest-memray and annotate tests:

@pytest.mark.limit_memory("50 MB")
def test_allocation_bound():
result = expensive_function()
assert validate(result)