Class is Now in Session

Presenting The Data School, an educational video series for people who work with data

Learn More

Wrangling in the Azure Cloud

March 5, 2018

The world is more cloud centric than ever

Cloud infrastructure has exploded in popularity over the past decade. Looking back at 2009, annual spending on cloud infrastructure was virtually zero; at the close of 2019, it reached almost $100 billion

Though the cloud market is vast and varied, there’s a few key players that have led this surge in cloud infrastructure: Amazon Web Services, Google Cloud and Microsoft Azure. Deciding between “big three,” as they’ve been dubbed, isn’t cut and dry, largely because each organization must consider their own unique workloads and variables. What’s more, many of today’s organizations aren’t just picking one cloud provider, but rather are taking a multi-cloud approach that encompasses several vendors. 

For the sake of this post, however, let’s zero in on one of the big three vendors, Microsoft Azure, to understand its strengths and practical use. 

What is Azure?

Azure is Microsoft’s cloud computing offering. Cloud computing means that organizations or individuals pay for access to the computing resources of a specific provider (Microsoft) instead of footing the bill for their own data servers. Microsoft Azure offers a range of options to take advantage of their cloud offering, including Infrastructure as a Service (IaaS), Platform as a Service (Paas), and Software as a Service (SaaS). 

The difference between Azure IaaS vs. PaaS vs. SaaS basically breaks down to how much or how little you want to manage yourself. It’s worth noting that all of these options will require less management than an on-premise data infrastructure solution, which is what ultimately makes cloud infrastructure a less expensive option overall. 

  • Azure IaaS: Relieves organizations from buying and hosting their own physical data servers by leveraging Microsoft’s data center. Includes networking firewalls/security. Allows organizations to easily scale their usage up and down based upon demand, which makes it an efficient solution.
  • Azure Paas: Includes all of the above but can include middleware, development tools, business intelligence (BI) services, database management systems, and more.
  • Azure SaaS: Provides end-to-end management, allowing users and organizations to get up and running with their data projects right away. Operates on a pay-as-you-go model so that users only pay for the exact services they need. 

What are the benefits of Azure?

There are a few benefits of Microsoft Azure that are intrinsic to all cloud providers. Those include: 

  • Reduced IT costs
    As mentioned above, cloud computing negates the need for your own data servers, which means IT doesn’t need to spend time setting up and maintaining these servers. The IT costs returned to the organization can be quite significant, depending upon how many servers the organization would require.
  • Scalability
    Cloud computing also allows organizations to scale usage up and down as their needs change. That means no overpaying for computing power that goes unused—organizations can pay for exactly what they need, and no more.
  • Competitive Edge
    Many organizations will point to the competitive edge that a cloud platform offers as one of its major benefits. It isn’t necessarily the cloud platform itself that offers a competitive edge, but what it enables: increased collaboration, faster response time, and improved access to cloud-based data sources. 

In addition to the above benefits, Microsoft Azure offers its own set of Azure products and services to help organizations drive machine learning initiatives, build IoT solutions, unify platforms, and much more. Many of the largest organizations have selected Azure as their cloud platform—Microsoft Azure states that “95% of Fortune 500 companies trust their business on Azure.” 

Wrangling on Azure Cloud

The move to cloud platforms like Azure has brought an increasing number of analytics and machine learning (ML) initiatives to the cloud. But that doesn’t necessarily mean that all of the data required for these initiatives is hosted in the cloud—often, the data locality balances in and outside the cloud.

The good news? This hybrid mandate plays very well with the Microsoft Azure cloud ecosystem—Azure offers the ability to extend services/capabilities to an organization’s environment of choice. But there’s still the question of which services/capabilities, exactly, should organizations implement in order to give their business users uniform access to the data they need, no matter where it lives. Business users need to be able to explore and prepare data so that it fits the specific needs of their initiatives, especially when that data is generated from many different sources. 

Organizations would be remiss in choosing a more traditional method of preparing this data, such as Extract, Transform, and Load (ETL) tools. ETL tools were built for IT users, not business users, which often leaves business users waiting in line to get data cleaned, passing specs back and forth until they’ve received their desired output. In other words, ETL doesn’t provide the efficiency one pictures when adopting a cloud platform. 

Instead, in conjunction with cloud platforms like Microsoft Azure, many organizations are looking toward modern data preparation platforms, such as Trifacta. A data preparation platform like Trifacta is the necessary medium between complex data platforms and business users with little to no technical skills. Its visual and machine learning-driven user experience allows business users to deeply understand their data right away and make the right transformations so that the data is ready for use in analytics initiatives. Plus, common metadata means that Trifacta is seamless across data platforms. 

Trifacta on Microsoft Azure Marketplace

Trifacta is available for direct purchase on the Microsoft Azure Marketplace under the intelligence, analytics and compute categories. Customers can take immediate advantage of Trifacta’s platform by choosing the “Get It Now” option, transacted and contracted directly through the Azure Marketplace, for the rapid integration and deployment on Azure Data Lake Storage Gen2, Azure Databricks, Azure SQL Data Warehouse and Cloudera.

Fig 1 – Typical deployment architecture of Trifacta on Microsoft Azure

Trifacta integrates natively with several components and services that are part of the Azure Cloud Platform. Most importantly it takes into consideration key security requirements to ensure data access and processing meet strict Enterprise governance standards and protocols.

Storage (Azure Data Lake & Windows Azure Storage Blob)

You can wrangle data stored either in ADLS or WASB using Trifacta. These storage services provided by Azure allow a large variety of use cases to be supported. Combined with the security framework described below, data access is always secure.

Analytics Store (SQL Data Warehouse)

Once data is wrangled in Trifacta, it can be made available to a variety of downstream analytics platforms and applications. Azure SQL Data Warehouse is the most popular platform for interoperating with analytics applications such as Power BI, Tableau and Qlik. Trifacta allows for read and write access from and to SQL Data Warehouse via either JDBC (small/medium sized data) or Polybase (larger volume data) interfaces.

Data Processing (Photon or Spark via HDInsight)

Whether your data volume is GB, TB or PB, Trifacta can easily wrangle them all on Azure by leveraging different compute engines that’s best suited for the workload.  For small to medium data volumes, Trifacta’s unique Photon in memory compute framework is made available within the application running on Azure. For larger volumes, Trifacta integrates natively with Apache Spark running on latest HDInsight v3.6.

Security (SSO, Domain Joined Cluster)

Trifacta Wrangler Enterprise supports secure data access to all the resources provided on Azure via various SSO technologies.  By default, you can authenticate through Azure Active Directory (Azure AD), a fully cloud enabled directory service offered by Microsoft. You can also integrate your existing LDAP directory services to that of Azure AD and fully leverage secured access to Trifacta.  

For full enterprise security support, you can also choose to configure your HDInsight cluster to be a domain joined cluster, where it’s part of your Active Directory Domain. Trifacta supports accessing and running wrangling jobs against a domain joined cluster.  For secured Hive access, Trifacta also supports Apache Ranger in conjunction with HDInsight.

Trifacta is Azure-Cosell Ready

This comprehensive support of Azure data services and the increasing customer adoption on Azure drove Microsoft’s attention to certify Trifacta as a Microsoft co-sell partner. Co-sell status indicates not only a certain level of large strategic joint customers but also a deep technical due diligence Microsoft conducted reviewing Trifacta’s solution on Azure. For customers, this means the joint solution has been tested at some of the world’s largest enterprises as well as deeply reviewed by Microsoft Azure experts. This extreme level of vetting can ensure your organization can have confidence rolling out Trifacta on Azure across your organization.

More?

Want to learn more about running Trifacta on Azure? Here are some additional resources to check out:

Related Posts

Trifacta for Data Quality: Introducing Smart Cleaning

As part of our expanded focus into Data Quality, Trifacta recently announced a new approach aimed at quickly... more

  |  April 2, 2019

October ’19 Wrangler Release — Import and Export of Macros and Firefox Support

In July, we unveiled Macros, a new feature that allows you to turn a sequence of steps in Trifacta into a... more

  |  October 22, 2019

January ’20 Wrangler Release Highlight – Job Results in PDF, Report an Issue

We are starting the new year with two highly requested features from our users! New in the the January ‘20... more

  |  January 29, 2020