Big Data Management on Apache Spark

Northeastern University

Sep 2024 - Oct 2024

Overview

Designed distributed computing workflows for streaming and batch analytics.

Student

Executed a distributed data workflow using Apache Spark, Hive, and Impala to process 10M+ records for data warehouse efficiency analysis.

Implemented data partitioning and Parquet compression techniques, improving query performance by 25% and reducing I/O operations by 30%.