Talend Big Data: Complete Spark and Hadoop Integration Guide

Talend Big Data enables processing massive datasets using Spark and Hadoop. This guide covers architecture, job design, and optimization for enterprise big data workloads.

Big Data Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    TALEND BIG DATA ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   TALEND STUDIO                    EXECUTION ENVIRONMENTS                   │
│  ┌────────────────────┐          ┌────────────────────────────────────┐   │
│  │                    │          │                                    │   │
│  │  Big Data Job      │          │   ┌──────────────────────────┐    │   │
│  │  Designer          │──────────│──►│  Spark Cluster           │    │   │
│  │                    │          │   │  • Standalone             │    │   │
│  │  ┌──────────────┐  │          │   │  • YARN                   │    │   │
│  │  │ Spark        │  │          │   │  • Kubernetes             │    │   │
│  │  │ Components   │  │          │   │  • Databricks             │    │   │
│  │  └──────────────┘  │          │   │  • EMR/Dataproc           │    │   │
│  │                    │          │   └──────────────────────────┘    │   │
│  │  ┌──────────────┐  │          │                                    │   │
│  │  │ Hadoop       │  │          │   ┌──────────────────────────┐    │   │
│  │  │ Components   │  │──────────│──►│  Hadoop Cluster          │    │   │
│  │  └──────────────┘  │          │   │  • HDFS Storage           │    │   │
│  │                    │          │   │  • YARN Resource Mgmt     │    │   │
│  │  ┌──────────────┐  │          │   │  • Hive Metastore         │    │   │
│  │  │ Cloud Data   │  │          │   └──────────────────────────┘    │   │
│  │  │ Lake         │  │          │                                    │   │
│  │  └──────────────┘  │          │   ┌──────────────────────────┐    │   │
│  │                    │──────────│──►│  Cloud Data Lakes         │    │   │
│  └────────────────────┘          │   │  • S3/Delta Lake          │    │   │
│                                  │   │  • Azure Data Lake        │    │   │
│                                  │   │  • Google Cloud Storage   │    │   │
│                                  │   └──────────────────────────┘    │   │
│                                  └────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

Spark Job Configuration

Spark Connection Setup

/*
 * tSparkConfiguration Component Settings:
 *
 * Spark Mode Options:
 * - Local: local[*] (for development)
 * - Standalone: spark://master:7077
 * - YARN: yarn-client or yarn-cluster
 * - Kubernetes: k8s://https://kubernetes-api:6443
 */
 
// tSparkConfiguration - YARN Mode
/*
 * Basic Settings:
 * - Spark Version: 3.x
 * - Spark Mode: YARN
 * - Deployment Mode: Cluster (production) / Client (development)
 *
 * YARN Settings:
 * - Resource Manager: yarn-rm.cluster.local:8032
 * - HDFS namenode: hdfs://namenode:8020
 *
 * Spark Properties:
 * - spark.executor.instances: 10
 * - spark.executor.memory: 8g
 * - spark.executor.cores: 4
 * - spark.driver.memory: 4g
 * - spark.dynamicAllocation.enabled: true
 * - spark.shuffle.service.enabled: true
 *
 * Additional Spark Properties:
 * - spark.sql.shuffle.partitions: 200
 * - spark.default.parallelism: 200
 * - spark.sql.adaptive.enabled: true
 * - spark.serializer: org.apache.spark.serializer.KryoSerializer
 */
 
// Context variables for environment-specific configuration
// DEV context:
context.spark_master = "local[4]"
context.spark_executor_memory = "2g"
context.spark_executor_instances = "2"
 
// PROD context:
context.spark_master = "yarn"
context.spark_executor_memory = "16g"
context.spark_executor_instances = "50"

Spark Batch Job Structure

/*
 * Typical Spark Batch Job Structure:
 *
 * tSparkConfiguration
 *       │
 *       ▼
 * tHDFSConfiguration (optional)
 *       │
 *       ▼
 * tFileInputDelimited (Spark) ──► tMap (Spark) ──► tFilterRow (Spark)
 *       │                                                │
 *       │                                                ▼
 *       │                              tAggregateRow (Spark)
 *       │                                                │
 *       │                                                ▼
 *       │                              tFileOutputParquet (Spark)
 *       │
 *       ▼
 * tSparkClose (optional - for explicit shutdown)
 */
 
// Job generates Spark Scala/Java code like:
/*
 * val spark = SparkSession.builder()
 *     .appName("TalendSparkJob")
 *     .master(sparkMaster)
 *     .config("spark.executor.memory", executorMemory)
 *     .getOrCreate()
 *
 * val inputDF = spark.read
 *     .option("header", "true")
 *     .option("inferSchema", "true")
 *     .csv(inputPath)
 *
 * val transformedDF = inputDF
 *     .filter(col("status") === "ACTIVE")
 *     .groupBy("category")
 *     .agg(sum("amount").as("total_amount"))
 *
 * transformedDF.write
 *     .mode("overwrite")
 *     .parquet(outputPath)
 */

HDFS Operations

HDFS Connection Configuration

/*
 * tHDFSConfiguration Settings:
 *
 * Distribution: Cloudera/Hortonworks/MapR/Custom
 * Version: Specific distribution version
 *
 * NameNode URI: hdfs://namenode.cluster.local:8020
 * Username: hdfs_user
 *
 * Kerberos Authentication (if enabled):
 * - Use Kerberos: true
 * - Principal: talend@REALM.COM
 * - Keytab: /path/to/talend.keytab
 *
 * High Availability Configuration:
 * - NameNode HA: true
 * - NameService: mycluster
 * - NameNodes: nn1,nn2
 * - nn1: namenode1.cluster.local:8020
 * - nn2: namenode2.cluster.local:8020
 */
 
// Context-based HDFS configuration
// In context variables:
context.hdfs_namenode = "hdfs://namenode:8020"
context.hdfs_user = "talend_etl"
context.hdfs_data_path = "/data/raw"
context.hdfs_output_path = "/data/processed"

HDFS File Operations

/*
 * tHDFSInput - Read files from HDFS
 *
 * Settings:
 * - HDFS Path: context.hdfs_data_path + "/sales/*.csv"
 * - File Type: CSV
 * - Header: true
 * - Encoding: UTF-8
 */
 
/*
 * tHDFSOutput - Write files to HDFS
 *
 * Settings:
 * - HDFS Path: context.hdfs_output_path + "/sales_processed"
 * - File Type: Parquet (recommended for analytics)
 * - Overwrite: true
 * - Compression: SNAPPY
 */
 
/*
 * tHDFSDelete - Clean up files
 *
 * Settings:
 * - HDFS Path: context.hdfs_output_path + "/_temporary"
 * - Recursive: true
 */
 
/*
 * tHDFSList - List directory contents
 *
 * Settings:
 * - HDFS Path: context.hdfs_data_path
 * - Recursive: true
 * - Output: File path, size, modification time
 */
 
/*
 * tHDFSCopy - Move files within HDFS
 *
 * Settings:
 * - Source: context.hdfs_data_path + "/landing/*.csv"
 * - Destination: context.hdfs_data_path + "/archive/"
 * - Move (delete source): true
 */

Hive Integration

Hive Connection and Queries

/*
 * tHiveConnection Settings:
 *
 * Distribution: Cloudera/Hortonworks
 * Hive Version: 3.x
 *
 * Connection Mode: HiveServer2
 * Host: hiveserver2.cluster.local
 * Port: 10000
 * Database: analytics_db
 * Username: hive_user
 *
 * Authentication:
 * - None (development)
 * - Kerberos (production)
 *
 * Additional JDBC Parameters:
 * - transportMode=binary
 * - ssl=true (if enabled)
 */
 
/*
 * tHiveInput - Read from Hive table
 *
 * Settings:
 * - Query: SELECT * FROM sales_fact WHERE sale_date >= '2024-01-01'
 * - OR Table Name: analytics_db.sales_fact
 *
 * For Partitioned Tables:
 * - Query: SELECT * FROM sales_fact WHERE year=2024 AND month=01
 */
 
/*
 * tHiveOutput - Write to Hive table
 *
 * Settings:
 * - Table: analytics_db.sales_summary
 * - Action: Create table if not exists / Truncate and insert / Append
 *
 * Partitioning:
 * - Partition columns: year, month
 * - Dynamic partitioning: true
 *
 * Table Properties:
 * - File format: Parquet
 * - Compression: Snappy
 */
 
/*
 * tHiveRow - Execute Hive DDL/DML
 *
 * Example - Create partitioned table:
 */
String createTableSQL = "CREATE TABLE IF NOT EXISTS analytics_db.daily_metrics (" +
    "metric_name STRING, " +
    "metric_value DOUBLE, " +
    "record_count BIGINT, " +
    "created_at TIMESTAMP" +
    ") " +
    "PARTITIONED BY (process_date STRING) " +
    "STORED AS PARQUET " +
    "TBLPROPERTIES ('parquet.compression'='SNAPPY')";
 
/*
 * Example - Add partition:
 */
String addPartitionSQL = "ALTER TABLE analytics_db.daily_metrics " +
    "ADD IF NOT EXISTS PARTITION (process_date='" + context.process_date + "')";
 
/*
 * Example - MSCK repair (discover partitions):
 */
String repairSQL = "MSCK REPAIR TABLE analytics_db.daily_metrics";

Spark with Hive Integration

/*
 * Spark Job reading/writing Hive tables
 *
 * tSparkConfiguration Settings:
 * - Enable Hive: true
 * - Hive Metastore URI: thrift://hive-metastore:9083
 *
 * OR use warehouse directory:
 * - spark.sql.warehouse.dir: hdfs://namenode:8020/user/hive/warehouse
 */
 
/*
 * tHiveInput (Spark mode) - Read Hive table as DataFrame
 *
 * Query: SELECT * FROM analytics_db.transactions WHERE status = 'COMPLETED'
 *
 * Generated Spark code:
 * val transactionsDF = spark.sql(
 *     "SELECT * FROM analytics_db.transactions WHERE status = 'COMPLETED'"
 * )
 */
 
/*
 * tHiveOutput (Spark mode) - Write DataFrame to Hive
 *
 * Settings:
 * - Database: analytics_db
 * - Table: transaction_summary
 * - Mode: Overwrite / Append
 * - Partition By: year, month
 *
 * Generated Spark code:
 * summaryDF.write
 *     .mode("overwrite")
 *     .partitionBy("year", "month")
 *     .format("hive")
 *     .saveAsTable("analytics_db.transaction_summary")
 */

Cloud Data Lake Integration

AWS S3 with Spark

/*
 * tS3Configuration for Spark
 *
 * Settings:
 * - Access Key: context.aws_access_key
 * - Secret Key: context.aws_secret_key
 * - Region: us-east-1
 *
 * OR use IAM Role (recommended for EMR):
 * - Inherit credentials from role: true
 *
 * Spark Configuration Properties:
 * - spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
 * - spark.hadoop.fs.s3a.aws.credentials.provider:
 *     com.amazonaws.auth.InstanceProfileCredentialsProvider
 */
 
/*
 * tFileInputDelimited (Spark) - Read from S3
 *
 * Path: s3a://my-data-lake/raw/sales/year=2024/month=01/*.csv
 *
 * For Delta Lake format:
 * - spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
 * - spark.sql.catalog.spark_catalog:
 *     org.apache.spark.sql.delta.catalog.DeltaCatalog
 */
 
// tJava - Read Delta Lake table
/*
 * val deltaDF = spark.read
 *     .format("delta")
 *     .load("s3a://my-data-lake/bronze/transactions")
 *
 * // Time travel - read previous version
 * val historicalDF = spark.read
 *     .format("delta")
 *     .option("versionAsOf", 5)
 *     .load("s3a://my-data-lake/bronze/transactions")
 */
 
/*
 * tFileOutputParquet (Spark) - Write to S3
 *
 * Path: s3a://my-data-lake/silver/sales_processed/
 * Mode: Overwrite
 * Partition By: year, month, day
 * Compression: SNAPPY
 */

Azure Data Lake with Spark

/*
 * Azure Data Lake Gen2 Configuration
 *
 * Spark Properties:
 * - spark.hadoop.fs.azure.account.auth.type.<storage>.dfs.core.windows.net:
 *     OAuth
 * - spark.hadoop.fs.azure.account.oauth.provider.type.<storage>.dfs.core.windows.net:
 *     org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
 * - spark.hadoop.fs.azure.account.oauth2.client.id.<storage>.dfs.core.windows.net:
 *     context.azure_client_id
 * - spark.hadoop.fs.azure.account.oauth2.client.secret.<storage>.dfs.core.windows.net:
 *     context.azure_client_secret
 * - spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage>.dfs.core.windows.net:
 *     https://login.microsoftonline.com/<tenant-id>/oauth2/token
 */
 
/*
 * tFileInputDelimited (Spark) - Read from ADLS Gen2
 *
 * Path: abfss://raw@mydatalake.dfs.core.windows.net/sales/2024/*.parquet
 */
 
/*
 * tFileOutputParquet (Spark) - Write to ADLS Gen2
 *
 * Path: abfss://processed@mydatalake.dfs.core.windows.net/sales_silver/
 */

Google Cloud Storage with Spark

/*
 * GCS Configuration for Spark
 *
 * Spark Properties:
 * - spark.hadoop.google.cloud.auth.service.account.enable: true
 * - spark.hadoop.google.cloud.auth.service.account.json.keyfile:
 *     /path/to/service-account.json
 *
 * OR for GKE/Dataproc (use default credentials):
 * - spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
 */
 
/*
 * tFileInputDelimited (Spark) - Read from GCS
 *
 * Path: gs://my-data-bucket/raw/events/date=2024-01-15/*.json
 */
 
/*
 * tFileOutputParquet (Spark) - Write to GCS
 *
 * Path: gs://my-data-bucket/processed/events/
 * Partition By: event_date
 */

Performance Optimization

Spark Job Tuning

/*
 * Key Spark Performance Settings
 */
 
// Parallelism Configuration
// tSparkConfiguration Additional Properties:
/*
 * # Shuffle partitions (default 200, adjust based on data size)
 * spark.sql.shuffle.partitions=400
 *
 * # Default parallelism for RDD operations
 * spark.default.parallelism=400
 *
 * # Adaptive query execution (Spark 3.x)
 * spark.sql.adaptive.enabled=true
 * spark.sql.adaptive.coalescePartitions.enabled=true
 * spark.sql.adaptive.skewJoin.enabled=true
 */
 
// Memory Configuration
/*
 * # Executor memory allocation
 * spark.executor.memory=16g
 * spark.executor.memoryOverhead=4g  # 20-25% of executor memory
 *
 * # Memory fraction for execution/storage
 * spark.memory.fraction=0.8
 * spark.memory.storageFraction=0.3
 *
 * # Off-heap memory (for large datasets)
 * spark.memory.offHeap.enabled=true
 * spark.memory.offHeap.size=8g
 */
 
// Serialization
/*
 * # Use Kryo serialization (faster than Java serialization)
 * spark.serializer=org.apache.spark.serializer.KryoSerializer
 * spark.kryo.registrationRequired=false
 */
 
// Caching Strategy
// tJava in Spark job:
/*
 * // Cache frequently accessed DataFrame
 * val cachedDF = inputDF.cache()
 *
 * // Or persist with specific storage level
 * import org.apache.spark.storage.StorageLevel
 * val persistedDF = inputDF.persist(StorageLevel.MEMORY_AND_DISK_SER)
 *
 * // Unpersist when done
 * cachedDF.unpersist()
 */

Data Partitioning Strategies

/*
 * Partitioning for Write Operations
 *
 * tFileOutputParquet (Spark):
 * - Partition By: year, month, day (for time-series data)
 *
 * Generated Spark code:
 * outputDF.write
 *     .mode("overwrite")
 *     .partitionBy("year", "month", "day")
 *     .parquet(outputPath)
 */
 
/*
 * Repartitioning for Balanced Processing
 *
 * tJava in Spark flow:
 */
// Repartition by key (for joins)
val repartitionedDF = inputDF.repartition(200, col("customer_id"))
 
// Coalesce to reduce partitions (no shuffle)
val coalescedDF = processedDF.coalesce(50)
 
// Repartition evenly (shuffle)
val balancedDF = skewedDF.repartition(200)
 
/*
 * Handling Data Skew
 *
 * Option 1: Salt the key
 */
val saltedDF = inputDF
    .withColumn("salt", (rand() * 10).cast("int"))
    .withColumn("salted_key", concat(col("customer_id"), lit("_"), col("salt")))
 
// Join on salted key, then remove salt
 
/*
 * Option 2: Broadcast smaller table
 */
import org.apache.spark.sql.functions.broadcast
val resultDF = largeDF.join(broadcast(smallDF), "key")
 
/*
 * Option 3: Enable adaptive skew join (Spark 3.x)
 * spark.sql.adaptive.skewJoin.enabled=true
 */

Incremental Processing

/*
 * Incremental Load Pattern with Spark
 *
 * Job Structure:
 * 1. Read last processed watermark
 * 2. Read source data > watermark
 * 3. Process and write
 * 4. Update watermark
 */
 
// tJava - Read watermark from control table
/*
 * val lastWatermark = spark.sql(
 *     "SELECT MAX(processed_timestamp) as watermark " +
 *     "FROM control.job_watermarks WHERE job_name = 'sales_etl'"
 * ).collect()(0).getAs[java.sql.Timestamp]("watermark")
 *
 * // If null, use epoch
 * val watermarkValue = Option(lastWatermark).getOrElse(
 *     java.sql.Timestamp.valueOf("1970-01-01 00:00:00")
 * )
 */
 
// tHiveInput (Spark) - Incremental read
/*
 * Query:
 * SELECT * FROM source_db.transactions
 * WHERE modified_timestamp > '${watermark}'
 * AND modified_timestamp <= '${current_timestamp}'
 */
 
// tJava - Update watermark after successful load
/*
 * val newWatermark = processedDF.agg(max("modified_timestamp")).collect()(0)(0)
 *
 * spark.sql(s"""
 *     INSERT OVERWRITE TABLE control.job_watermarks
 *     PARTITION (job_name = 'sales_etl')
 *     SELECT '$newWatermark' as processed_timestamp
 * """)
 */

Monitoring and Debugging

Spark UI Integration

/*
 * tSparkConfiguration - Enable monitoring
 *
 * Spark Properties:
 * - spark.ui.enabled: true
 * - spark.ui.port: 4040
 * - spark.eventLog.enabled: true
 * - spark.eventLog.dir: hdfs:///spark-events
 * - spark.history.fs.logDirectory: hdfs:///spark-events
 */
 
// Access Spark UI during job execution:
// http://driver-host:4040
 
// Spark History Server (for completed jobs):
// http://history-server:18080
 
/*
 * Key Metrics to Monitor:
 * - Stage completion times
 * - Shuffle read/write sizes
 * - Task distribution (check for skew)
 * - Executor memory usage
 * - GC time percentage
 */

Logging and Debugging

// tJava - Add logging to Spark job
 
import org.apache.log4j.Logger;
 
// Get logger
Logger logger = Logger.getLogger("TalendSparkJob");
 
// Log DataFrame statistics
logger.info("Input row count: " + inputDF.count());
logger.info("Input schema: " + inputDF.schema().treeString());
 
// Log sample data for debugging
logger.debug("Sample records:");
inputDF.show(5, false);
 
// Log execution plan
logger.debug("Execution plan:");
logger.debug(processedDF.queryExecution().toString());
 
// Log physical plan
logger.debug("Physical plan:");
processedDF.explain(true);
 
/*
 * Log4j Configuration (in log4j.properties):
 *
 * log4j.rootCategory=WARN, console
 * log4j.logger.TalendSparkJob=INFO
 * log4j.logger.org.apache.spark=WARN
 * log4j.logger.org.apache.hadoop=WARN
 */

Best Practices Summary

talend_big_data_best_practices:
  job_design:
    - "Use Spark components (tXxxSpark) for distributed processing"
    - "Minimize shuffles - partition data strategically"
    - "Use broadcast joins for small lookup tables"
    - "Enable adaptive query execution in Spark 3.x"
 
  file_formats:
    - "Use Parquet/ORC for analytics (columnar, compressed)"
    - "Use Avro for streaming/messaging (schema evolution)"
    - "Partition data by commonly filtered columns"
    - "Use appropriate compression (Snappy for speed, GZIP for size)"
 
  performance:
    - "Right-size executors (not too big, not too small)"
    - "Enable dynamic allocation for variable workloads"
    - "Cache DataFrames used multiple times"
    - "Tune shuffle partitions based on data volume"
 
  reliability:
    - "Implement checkpointing for long-running jobs"
    - "Use idempotent writes (overwrite partitions)"
    - "Monitor jobs via Spark UI and History Server"
    - "Set appropriate timeouts and retries"
 
  cloud_specific:
    - "Use cloud-native storage (S3/ADLS/GCS)"
    - "Leverage managed Spark services (EMR/Databricks/Dataproc)"
    - "Use IAM roles instead of access keys"
    - "Consider spot/preemptible instances for cost savings"

Conclusion

Talend Big Data enables enterprise-scale data processing with native Spark and Hadoop integration. Design jobs using Spark components, optimize partitioning strategies, and leverage cloud data lakes for modern architectures. Proper tuning of Spark configurations ensures optimal performance for big data workloads.

Is your AI system compliant with the EU AI Act? Free risk assessment - find out in 2 minutes →