Skip to main content

Documentation Index

Fetch the complete documentation index at: https://openmetadata-feat-feat-2mbfixtestexui.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Profiler Module

The profiler module lives at ingestion/src/metadata/profiler/. It computes statistical metrics on database tables (row counts, null ratios, histograms, etc.) and publishes the results back to the OpenMetadata server. This guide walks through the architecture top-down: from the workflow trigger, through the metrics system, down to how individual metrics are computed and published.

Directory Layout

profiler/
├── api/
│   └── models.py                    # DTOs: ProfilerResponse, ProfilerProcessorConfig
├── config.py                        # Config helpers for schema/database profiler settings
├── factory.py                       # Abstract factory for profiler interfaces
├── registry.py                      # MetricRegistry, TypeRegistry enums
├── metrics/                         # All metric definitions
│   ├── core.py                      # Base classes: Metric, StaticMetric, QueryMetric, etc.
│   ├── registry.py                  # Metrics enum — the central metric catalog
│   ├── static/                      # SQL aggregate metrics (COUNT, SUM, MIN, MAX, ...)
│   ├── composed/                    # Derived metrics (NullRatio, DuplicateCount, IQR, ...)
│   ├── window/                      # Percentile metrics (Median, Quartiles)
│   ├── hybrid/                      # Metrics combining queries + prior results (Histogram)
│   ├── system/                      # Database system metrics (DML ops, freshness)
│   └── pandas_metric_protocol.py    # Accumulator pattern for DataFrame metrics
├── interface/                       # Database abstraction layer
│   ├── profiler_interface.py        # Abstract base
│   └── sqlalchemy/                  # SQL implementations (+ dialect overrides)
├── processor/                       # Core profiling logic
│   ├── core.py                      # Profiler class — orchestrates metric computation
│   ├── processor.py                 # ProfilerProcessor — ingestion framework step
│   ├── runner.py                    # QueryRunner / PandasRunner
│   ├── metric_filter.py             # Selects metrics by config + column type
│   └── models.py                    # Internal processor models
├── source/                          # Profiler data sources
│   ├── metadata.py                  # OpenMetadataSource — fetches tables from OM API
│   ├── fetcher/                     # Entity fetching strategies
│   └── database/                    # Database-specific profiler sources
│       └── base/
│           ├── profiler_source.py   # Creates the profiler runner
│           └── profiler_resolver.py # Resolves sampler + interface by engine
├── orm/                             # ORM utilities
│   ├── registry.py                  # Type registries, dialect maps, type classifiers
│   ├── converter/                   # DB-specific type converters
│   └── functions/                   # Custom SQL functions (SumFn, LenFn, etc.)
└── adaptors/                        # Adaptors for non-SQL sources (NoSQL, etc.)

Workflow Pipeline

The profiler follows the standard Source → Processor → Sink pattern defined in workflow/profiler.py:
┌──────────────────────────────────────────────────────────┐
│  OpenMetadataSource                                      │
│  profiler/source/metadata.py                             │
│                                                          │
│  Fetches from OM API:                                    │
│  • Tables matching service/database/schema filters       │
│  • Global profiler configuration                         │
│  • Database service connection                           │
│                                                          │
│  Yields: ProfilerSourceAndEntity                         │
└──────────────────────┬───────────────────────────────────┘


┌──────────────────────────────────────────────────────────┐
│  ProfilerProcessor                                       │
│  profiler/processor/processor.py                         │
│                                                          │
│  For each table:                                         │
│  1. Create profiler runner (Profiler instance)           │
│  2. Execute all metrics                                  │
│  3. Build CreateTableProfileRequest                      │
│                                                          │
│  Yields: ProfilerResponse                                │
└──────────────────────┬───────────────────────────────────┘


┌──────────────────────────────────────────────────────────┐
│  MetadataRestSink                                        │
│  ingestion/sink/metadata_rest.py                         │
│                                                          │
│  POST /api/v1/tables/{id}/tableProfile                   │
└──────────────────────────────────────────────────────────┘

Metrics System

The metrics system is the heart of the profiler. Every metric is a class that knows how to compute itself against a data source.

Metric Type Hierarchy

Metric (ABC)                                # metrics/core.py

├── StaticMetric                            # SQL aggregate functions
│   fn(col, ...) → SQLAlchemy expression    # e.g., func.count(), func.avg()
│   Examples: Count, NullCount, Min, Max, Mean, Sum, StdDev,
│             DistinctCount, UniqueCount, MinLength, MaxLength

├── QueryMetric                             # Full query execution
│   query(col, ...) → SQLAlchemy Select     # Returns multiple rows
│   Examples: LikeCount, RegexCount

├── ComposedMetric                          # Derived from other metrics
│   fn(results) → value                     # Pure computation, no DB query
│   Examples: NullRatio, DistinctRatio, UniqueRatio,
│             DuplicateCount, InterQuartileRange

├── HybridMetric                            # Query + prior results
│   query(col, ...) → Select               # Needs results from earlier metrics
│   fn(results) → value
│   Examples: Histogram, CardinalityDistribution

├── WindowMetric                            # Percentile-based (window functions)
│   fn(col, ...) → SQLAlchemy expression
│   Examples: Median, FirstQuartile, ThirdQuartile

├── CustomMetric                            # User-defined SQL
│   sql(col, ...) → raw SQL string

└── SystemMetric                            # Database system-level
│   sql(col, ...) → dialect-specific query
│   Examples: DML operation counts, freshness

Metric Registry

All built-in metrics are registered in the Metrics enum (metrics/registry.py):
class Metrics(MetricRegistry):
    ROW_COUNT = RowCount           # Table-level
    COLUMN_COUNT = ColumnCount
    COUNT = Count                  # Column-level
    NULL_COUNT = NullCount
    DISTINCT_COUNT = DistinctCount
    MEAN = Mean
    # ... 40+ metrics
The MetricRegistry base class makes each enum member callableMetrics.ROW_COUNT() instantiates a RowCount metric.

Metric Computation Order

Metrics are computed in a strict dependency order:
1. Static Metrics    (SQL aggregates — independent)
2. Query Metrics     (full queries — independent)
3. Window Metrics    (percentiles — independent)
4. System Metrics    (database system queries — independent)
5. Custom Metrics    (user SQL — independent)

        ▼ results available
6. Composed Metrics  (derived from 1-5, e.g., NullRatio = NullCount / Count)

        ▼ composed results available
7. Hybrid Metrics    (need both query + prior results, e.g., Histogram)
Steps 1–5 run in parallel via a thread pool. Steps 6–7 run sequentially per column after the parallel phase completes.