Apache Hive and Apache Tez – Memory management and Tuning

Posted by Pravat Kumar Sutar

Jan 15, 2018 11:18:13 PM

 Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.

 

YARN considers all the available computing resources on each machine in the cluster. Based on the available resources, YARN negotiates resource requests from applications running in the cluster, such as MapReduce. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (for example, memory, CPU, and so on).

In a Hadoop cluster, it is important to balance the memory (RAM) usage, processors (CPU cores), and disks so that processing is not constrained by any one of these cluster resources. Generally, allow for 2 containers per disk and per core for the best balance of cluster utilization.

 

This article is meant to outline the best practices on memory management of application master and container, java heap size and memory allocation of distributed cache.

Read More

Topics: Hadoop, Hive, Tez

Tuning Resource Allocation in Apache Spark

Posted by Pravat Kumar Sutar

Jan 15, 2018 11:16:04 PM

 

Resource Allocation is an important aspect during the execution of any spark job. If not configured correctly, a spark job can consume entire cluster resources and make other applications starve for resources.

Here I have tried to provide some insights on configuration of resource allocation while running spark. The focus area is how to configure the number of executors, memory settings of each executors and the number of cores for a Spark Job. 

Read More

Topics: Hadoop, Spark

Oracle Database Connectivity in Power BI Report Server

Posted by Adhil Mowlana

Jan 8, 2018 11:03:25 AM

 

The Power BI Service is Microsoft's easy to  the cloud based Microsoft’s analytical suite that hosts Power BI Reports and dashboards. Cloud is very convenient and scalable, and it is very cost effective for small to medium scale organizations. Nevertheless, certain organizations still prefer that the data and reporting remains within their premises due to various concerns. Power BI Report server is an on-premise software that enables hosting Power BI Reports and traditional SQL Server Reporting Services (SSRS) reports in the same environment. (Power BI report server comes free with SQL Server Enterprise Edition or Power BI Premium.)

 

Read More

Topics: Oracle connectivity in Power BI, Connect Oracle to Power BI, Oracle connectivity in Power BI Report Server, SQL Server Analysis services from Oracle, Power BI Report server, Power BI reports from Oracle, SSAS from Oracle, On premise Power BI reports

Using ADF v2 and SSIS to load data from XML Source to SQL Azure

Posted by Adhil Mowlana

Nov 6, 2017 8:21:56 AM

 

Since the release of Azure Data Factory V2, I have played around with it a bit, but have been looking for an opportunity for a real world use case where V2 would be better suited than V1. Working with a local bank on a Proof of Concept has provided this opportunity. As part of the PoC, we are loading XML files into SQL Data Warehouse.

XML files are widely used, and they contain multiple entities of related data. Extracting data from XML files using Azure Data Factory V1 is still a lot of work with custom code. SQL Server Integration Services (SSIS) Packages are on the other hand, make this process much easier, and are a tried and tested solution. It is quite easy to extract from XML files and load them in to multiple staging tables with the relationships. With XML data sources being common in cloud data sets, Azure Data Factory V2 works very well for this use case.

Read More

Topics: SSIS in Azure, SSIS, Azure Data factory, SQL Server 2017

Migrating the Data warehouse to the cloud does not need to take years

Posted by Dang Trung Tin

Oct 30, 2017 6:47:08 PM

 

A few weeks ago, we had a chance meeting with the CIO of one big company in the Asia region. He shared that his team is having some performance issues with the existing data warehouse system and it was taking very long time for the data to be ready for end users every day and asked for our help.

Read More

Topics: PowerBI, Azure Analysis Services, Data Warehouse, Performance Improvement, Azure SQL Data Warehouse, Analysis Services

We build applications that transform data into insight.

From data discovery to cutting-edge business intelligence and performance management solutions, we strive to provide end-to-end solutions for businesses of all sizes.

Know More

Subscribe via E-mail