Azure Databricks is a fully managed, Platform-as-a-Service (PaaS) offering which was released on Feb 27, 2019, Azure Databricks leverages Microsoft cloud to scale rapidly, host massive amounts of data effortlessly, and streamline workflows for better collaboration between business executives, data scientists and engineers.
Azure Databricks is a “first party” Microsoft service, the result of a unique year-long collaboration between the Microsoft and Databricks teams to provide Databricks‘ Apache Spark-based analytics service as an integral part of the Microsoft Azure platform.
Azure Databricks uses the Azure Active Directory (AAD) security framework. Existing credentials authorization can be utilized, with the corresponding security settings. Access and identity control are all done through the same environment. Using AAD allows easy integration with the entire Azure stack including Data Lake Storage (as a data source or an output), Data Warehouse, Blob Storage, and Azure Event Hub.
You can use Blob storage to expose data publicly to the world or to store application data privately. For those of you familiar with Azure, Databricks is a premier alternative to Azure HDInsight and Azure Data Lake Analytics.
Connecting Azure Databricks to the Azure Storage Account
- Create a Storage Account and create a container(private) and upload a blob file in it.
- Upload the blob file into the container, you can download the file from the given link: https://csg10032000aeaa88a0.blob.core.windows.net/datafile/employe_data.csv
- Click on the context menu and click Generate SAS and copy the blob SAS Token and store it somewhere we will use it in future.
- Create an Azure Databricks
- Now click on create and select the subscription if you have many and select/create the resource group name, choose the location where you are trying to create these data bricks and finally select the pricing tier
- Remain the changes and click on Review + Create and wait for the validation
- Click on Create once your validation completes
- Click on Go to resource button once your deployment completes.
- Click on Launch Workspace then it will redirect to the Azure Databricks page.
- Now click on Clusters in the left pane and click on Create Cluster and provide the cluster name and Cluster-Mode as Standard and select the configuration details as same mentioned below and create the cluster
- Now start your cluster and make sure your cluster should be in a running state
- Now click on the workspace at the left pane, you can see one more workspace then right-click on workspace -> create -> notebook
- Now give the name of the notebook select Scala in Default Language and select the previous cluster that you have created and click on Create
- Now paste the below code in the notebook in order to make the connection with your storage account.
123456789101112val containerName = "<Container Name>"val storageAccountName = "<StorageAccount Nmae>"val sas = "<Generated SAS Key>"val config = "fs.azure.sas." + containerName+ "." + storageAccountName + ".blob.core.windows.net"dbutils.fs.mount(source = "wasbs://"+containerName+"@"+storageAccountName+".blob.core.windows.net/employe_data.csv",extraConfigs = Map(config -> sas))val mydf = spark.read.option("header","true").option("inferSchema", "true").csv("/mnt/myfile")display(mydf)
- If you can fetch the data as shown below then you have successfully completed connecting your Azure DataBricks with your storage Account.
So far, we understood about Azure DataBricks creation, creating cluster and Notebook and connecting our storage account with DataBricks to access the data using Scala. Engineers who collaborate with business stakeholders to identify and meet data requirements while designing and implementing the management, monitoring, security and privacy of data using the full stack of Azure services to satisfy business needs will benefit extensively from understanding Databricks.
Join our online forum discussions and study groups to enrich your knowledge in pursuit of becoming a Data Science expert with DP-200 Exam: Implementing an Azure Data Solution. Here is a comprehensive study guide to help you crack the exam along with sample questions.