Data Processing Workflows
Patterns and best practices to handle large datasets and streaming data reliably, efficiently, and cost-effectively.
Large-scale data systems blend batch and stream processing, robust storage, and strong operational controls. This guide covers common architectures, ingestion patterns, fault tolerance, schema management, and monitoring.
Batch processing
Good for large, periodic workloads (ETL, nightly aggregates). Key traits: high throughput, tolerant of higher latency, easier to debug and re-run.
Streaming processing
Required for low-latency use cases (real-time analytics, live recommendations). Key traits: continuous ingestion, event-time handling, exactly-once or at-least-once semantics considerations.
- Use managed ingestion where possible (Kafka, Kinesis, Pub/Sub, or cloud-managed connectors) to simplify scale.
- Prefer durable append-only logs as the ingestion abstraction — they enable replay, windowing and backfills.
- Support multiple connectors: webhooks → ingestion buffer → durable log; database CDC → connector → topic; file drops → object store events.
Choose the right tool for the job: Spark/Flink/Beam for large-scale batch/stream unified processing, or lightweight stream processors (Kafka Streams, ksqlDB) for simpler topologies. Use workflow orchestration (Airflow, Temporal) for complex pipelines.
Manage dependencies, retries, datasets, and backfills.
Support event-time processing, watermarks, and windowing for correctness.
- Use append-optimized storage for raw events (object store + partitioned prefixes, or log-based storage).
- Partition by time and a stable key (date=/hour= + tenant/product) to enable efficient scans and TTLs.
- Design schema evolution (versioning fields, nullable adds) so consumers can handle changes safely.
ELT (extract, load, transform) is often preferable: store raw events and run transforms downstream. Keep raw data immutable and reproducible for reprocessing.
- Save raw events with metadata (ingest_timestamp, source, partition) for lineage.
- Use incremental transforms with bookmark/state to avoid full reprocessing where possible.
- Provide idempotent transform jobs to safely retry or resume after failures.
Design for retries: include unique event IDs, producer sequence numbers, or durable logs so consumers can deduplicate. At-least-once delivery is common; build dedup logic or use exactly-once capable frameworks when necessary.
- Keep raw data immutable so you can replay and reprocess after bug fixes or schema changes.
- Provide replay windows or versioned datasets; orchestration should support targeted backfills.
- Test backfills on a sample subset before running across full history to validate correctness and cost.
Instrument per-stage metrics: ingestion lag, processing throughput, success/error rates, downstream consumer lag, and data quality checks. Alert on SLA breaches and anomalous drops in volume.
- Ingest lag (time between event timestamp and being available to consumers)
- Processing errors and poison message rate
- Consumer lag (committed offset behind head)
- Data quality failures (schema mismatch, null rates)
// Ingest webhook -> append to durable log
POST /ingest/webhook -> validate -> appendToLog(topic="events")
// Stream processor (Flink/Beam pseudo)
stream = readFromLog("events")
.stream()
.assignTimestamps(event => event.event_time)
.windowBy(event_time, sliding=1m)
.map(enrichEvent)
.sinkTo(bigquery_or_data_warehouse)
// Batch job (Airflow)
airflow dag daily_aggregates:
- extract last 24h from warehouse
- aggregate metrics
- write aggregates to analytics tableThese patterns emphasize durable logs, event timestamps, and separation of raw vs transformed data.
- Apply backpressure at ingestion if downstream systems are overloaded (buffer + shed or throttle producers).
- Use adaptive batching and circuit-breakers to protect processing clusters from spikes.
- Expose rate-limit headers for API producers so they can back off gracefully.
Balance storage, compute and latency. Keep raw data (cold) in cheap object storage while serving hot aggregates from a fast store. Use autoscaling and spot instances for cost savings where acceptable.
- Batch larger windows to reduce per-record overhead at the cost of latency.
- Cache aggregated results to avoid repetitive heavy computations.
- Use cheaper CPU paths for best-effort or non-real-time workloads.
- Store raw events immutably and include ingest metadata for lineage.
- Implement idempotency keys or deduplication for at-least-once systems.
- Maintain schema registry and changelog; require consumers to opt-in to breaking changes.
- Set SLAs for processing (p95 latency) and alert on lag or error spikes.
- Test backfills on sample data and document rollback procedures.
Was this page helpful?
Your feedback helps us improve RunAsh docs.