navinkumarnotes123.hashnode.dev·2 hours agoHive Partition with BucketIn Hive, partitioning and bucketing are two techniques used for organizing and optimizing data storage and querying. Hive Partition with bucket Example Scenario: Consider a dataset containing drug sales details, and it contains 6 pid, pname, drug, ge...Discusshivehive
navinkumarnotes123.hashnode.dev·20 hours agoHow to decide bucket count in hiveSteps Calculate Expected Bucket Size: Divide the table size by the block size on Hadoop to get an initial estimate. Expected Bucket Size = Table Size / Block Size on Hadoop Find the Nearest Power of 2: Take the base-2 logarithm of the ini...Discusshivehive
Kiran ReddyforDatabricks - PySparkdatabricks-pyspark-blogs.hashnode.dev·Mar 18, 2024Unlocking Data Potential: Introducing Databricks Unity CatalogIn today's data-driven world, managing vast amounts of information efficiently is crucial for businesses to thrive. Databricks, a leading provider of unified analytics platforms, continues to innovate in this space with its groundbreaking tool: Datab...Discuss·10 likesDatabricksUnityCatalog
ashik imamlyzer.hashnode.dev·Mar 16, 2024Exploring Deep Learning Algorithms in Data ScienceIntroduction In the realm of Data Science, deep learning algorithms have emerged as powerful tools for extracting meaningful insights from complex datasets. Deep learning, a subset of machine learning, involves training artificial neural networks to ...DiscussData Science
Kiran ReddyforDatabricks - PySparkdatabricks-pyspark-blogs.hashnode.dev·Mar 13, 2024Mount Points in DatabricksWhat is DBFS? DBFS stands for Databricks File System. It's a distributed file system that's part of the Databricks Unified Data Analytics Platform. DBFS provides a scalable and reliable way to store data across various Databricks clusters. It's desig...Discuss·10 likesmountpoint
Kinyanjui Karanjaoverflow.hashnode.dev·Mar 12, 2024Loading, Transforming, and Saving GitHub Archive Data with PySparkIntroduction: GitHub Archive provides a wealth of data capturing various activities on the GitHub platform, such as repository creation, issues opened, and pull requests made. In this blog post, we'll explore how to use PySpark, a powerful analytics ...DiscussPySpark
Cloud Tunedcloudtuned.hashnode.dev·Mar 11, 2024Exploring 5 Apache Hadoop Use CasesExploring 5 Apache Hadoop Use Cases Apache Hadoop, an open-source software framework, has revolutionized the way big data is managed and analyzed. Originally developed by Doug Cutting and Mike Cafarella in 2005, Hadoop has become synonymous with dist...Discusshadoop
Isaac Otengisaacoteng.hashnode.dev·Mar 7, 2024Data ingestion using AWS ServicesData ingestion using AWS Services, Part 2 Querying AWS S3 data from AWS Athena using SQL. AWS Athena is an interactive query service that makes it easy to analyze data on Amazon using standard SQL. In this second part of the tutorial, we are going to...DiscussAWS
Isaac Otengisaacoteng.hashnode.dev·Mar 7, 2024Data Ingestion Using AWS Services, Part 1Data ingestion using AWS Services, Part 1 Data ingestion is the process of collecting, importing, and transferring raw data from various sources to a storage or processing system where it can be further analyzed, transformed, and used for various pur...DiscussAWS
Kiran ReddyforDatabricks - PySparkdatabricks-pyspark-blogs.hashnode.dev·Feb 27, 2024Reading different files in PySparkIntroduction In this blog, we'll explore the versatile capabilities of Apache Spark with PySpark for reading, writing, and processing data in Databricks environments. From handling various file formats to seamlessly integrating with external data sou...Discuss·10 likesDatabricks