Azure Microsoft Cloud Bill Payment: Azure Synapse Analytics Modernized Data Warehouse Get started tutorial.

cloud 2026-06-01 阅读 11
2

In today's era of big data, many companies often fall into an extremely embarrassing "constipation state" when doing data analysis and reporting ":

The company has accumulated several terabytes or even petabytes of data for several years, all scattered in different places (business databases, log files, various third-party SaaS platforms). The product manager or the general manager of the operation wants to pull a "cross-quarter, multi-dimensional" user profile analysis report, and the result is "executed" in the traditional SQL database. Most of the day has passed and the system is still in a crazy circle. Very not easy to wait until the afternoon, not only did the report not come out, but also because the sky-high price query directly filled up the CPU of the database in the online production environment, resulting in the front-end APP being instantly stuck and being complained to pieces by customers.

This traditional "chimney" or "small workshop" data architecture is vulnerable to massive data. Business pain to death, development tired to death, operation and maintenance scared to death.

In order to completely reduce the dimension and crack down on the pain point of slow query of massive data and scattered data everywhere, Microsoft Cloud (Azure) has pulled out its trump card and ultimate weapon in the field of data analysis--

Azure Synapse Analytics (modern data warehouse/analytics services)

.

Its core logic is crude and elegant:

It is the traditional "enterprise data warehouse (Data Warehouse)" and modern "big data analysis (Big Data Analytics)" forced into a fully managed independent canopy space.

It relies on the bottom

The massively parallel processing (MPP) architecture can split complex giant queries that originally need to run for several hours into dozens or even hundreds of small tasks, and give them to the back-end computing cluster to bomb at the same time. All you need to do is write a standard SQL statement and hit Enter. In front of massive data, it can still give you a second-level response.

.

Today, we reject any official sermons and boring theoretical parameters, and directly cut into the production practice of real modern large factories. We will take you painlessly to uncover the mystery of Azure Synapse Analytics, and build a set of your own fast big data analysis positions in the cloud in 10 minutes.

Phase 1: Deep teardown, Azure Synapse's "multidimensional universe model"

Before you go to the console, you have to build a model of the physical world underlying Azure Synapse in your head. Many people will get lost when they enter its console because they don't understand that there are actually three completely different "parallel universes" in it ":

Universe 1: Serverless SQL Pool (Serverless SQL Pool, Exploration Pioneer): This is the most economical and magical black technology. It has no physical servers and counts as much money as the amount of data you query (about $5 for 1 TB). Its only task is when you have a bunch of messy CSV, JSON or Parquet files lying in the cloud.

When storing, you don't need to build any tables, you can directly use a standard SQL statement to "penetrate" these files like a database. Suitable for sudden data exploration.

Universe II: Dedicated SQL Pool (Dedicated SQL Pool, Main Heavy Cavalry): This is the traditional large-scale enterprise-level data warehouse (formerly Azure SQL DW). It is an entity cluster that receives money on an hourly basis. It uses a standard MPP (massively parallel processing) distributed architecture, where data comes in and is scattered and distributed to 60 underlying storage units. When you need to run the company's core, hundreds of millions of data daily fixed large reports, this heavy cavalry cluster will run at full speed, providing a dead and fixed second response.

Universe 3: All-in-one data integration (Synapse Pipelines, brick movers): You can think of it as Azure Data Factory(ADF) built into it. It does not require you to write a line of code, and can automatically "pump" data from your company's local self-built computer room or various external databases to this warehouse.

The brilliant point of the big factory is that the three universes are completely connected in the same interface, with data sharing and computing power isolation. This is the ceiling of the modern modern data center.

The second stage: actual combat exercise -10 minutes to build high-rise buildings on the ground and build modern speed warehouses

Make sure you already have an Azure account and have a basic

Azure Data Lake Storage Gen2 (Data Lake Storage)

Used to store original documents.

Step 1: Open up Synapse independent universe work area (Workspace)

Sign in to the Azure portal.

Enter "Azure Synapse Analytics" in the search bar above and click to enter the core console.

Click "Create" at the top: Basic Information: Select your resource group, name the workspace synapse-workspace-prod, and select the region nearest to you (e.g. East Asia Hong Kong). Select Data Lake (Storage Gen2): Select the Storage Account (storage account) that you built in advance and specify a container (Container) named raw-data. Note: This container will serve as the "rear base" for the entire number warehouse, where all original documents will be thrown.

Enter your administrator user name and password, and click Next until the creation is complete.

Step 2: Login to the God Perspective Workbench (Synapse Studio)

After the creation is complete (usually takes about 2 minutes), click to enter the resource page.

In the center, you will

There's a big, bright blue button: Open Synapse Studio ".

Don't hesitate to do it! The page will automatically jump to a completely independent, sci-fi data world workbench. All the data scientists, BI engineers and network managers in the big factory work side by side in this interface every day.

The third stage: actual combat exercise 2-using Serverless SQL to "penetrate" a second to query a large number of original files

Let's now simulate a most realistic development scenario: the company's overseas e-commerce system has just automatically thrown tens of millions of global user order transaction logs (Parquet format or CSV format) compressed by several GB last month into our

raw-data

The data lake container.

Now the product manager is eager to see: "who were the top 10 local tyrants with the highest consumption amount in the world last month?"

According to the previous practice, you have to build tables, write code and ETL to import these tens of millions of data into the database, which is a great deal of trouble. But in front of Synapse, we use

Serverless SQL

Play an extreme blitz.

On the left side of the Synapse Studio screen, click the Data icon.

Switch to the "Linked" tab, expand your Data Lake storage account, and find the folder where the order file is stored.

Witness the moment of black technology: right-click on the huge order file and select "New SQL script"-> "Select TOP 100 rows".

The system will automatically generate a miraculous SQL statement for you. Let's change it a little and directly write the core logic that the product manager wants:

Click the top

Run

.

The back-end Serverless computing power instantly exploded in situ, it does not need any index, directly in the data lake crazy sweep to read all the scattered files. After only a few seconds, the ID and total consumption of the 10 local tyrants jumped out neatly in the Results window below.

Pull over the product manager and turn the screen to him. The whole process is effortlessly. This is the speed of cloud native modernization.

The fourth stage: the history of avoiding the pit and tears under the high concurrent architecture of the large factory level.

This fully-managed big data center is quick to use, and it directly helps you smooth out all the complexity of the underlying distribution. But to survive in the truly harsh commercial high-traffic, high-concurrent reporting battlefield, as the chief data architect, you must immediately issue an executive order to weld the following two invisible holes before closing your computer:

1. The financial tragedy caused by the deadly "blind sweep of Serverless SQL".

As mentioned earlier, Serverless SQL is extremely convenient and does not need to be opened.

machine, based on the amount of data queried (about $5 per TB scanned).

Disaster: If your company has a primary development or operation, write a very non-standard query statement (for example, there is no time range limit, directly use SELECT * fuzzy matching to scan the whole disk), and then stuff the query into a loop script that automatically triggers every 5 minutes. Since it scans hundreds of GB of raw logs crazily every time, in a few days, this SQL-Serverless scanning bill can easily burn thousands of dollars directly, and the finance department will come to you directly with a knife.

Architect standard death-free gold medal configuration: physical speed limit lock: in Synapse Studio, click "manage"-> "SQL pools". Click Control to Serverless the built-in settings of the SQL pool and forcibly configure the "Daily/Weekly/Monthly data processed limits" (maximum daily/weekly/monthly data processing limits). For example, set it to sweep up to 2 TB per day. Once there is junk code or dead cycle script trigger exceeds the standard, the system will ruthlessly cut off the query and report an error for a second, holding the company's capital market.

2. It is strictly prohibited to use "traditional row-level random jitter" (Row-by-Row Updates) in a dedicated SQL pool.

When you opened

Dedicated SQL Pool (Dedicated SQL Pool)

When used to do core data warehouse, your code habits must completely change from "small workshop" thinking to "distributed" thinking.

Insider exposure: in traditional relational databases (such as SQL Server/MySQL), we often write UPDATE my_table SET status = 1 WHERE id = 123;. However, in the Synapse distributed architecture, the data is scattered and distributed among 60 storage nodes. If you run this single-record Update or Insert crazily with a loop in the code or ETL process, it will cause the underlying distributed coordination brain (Control Node) to be completely brain-dead for frequent table lock and network synchronization, and the speed is 100 times slower than that of a single-machine database!

Hardcore reinforcement specification: always adopt the genre of "bulk to full repair" (Bulk Load). If you need to update the data, always use the high-profile PolyBase or COPY command to smash tens of thousands of new data into a temporary segment table (Staging Table). Then bulk overwrite or merge with a clean, pure, set-oriented statement. Follow the appetite of distributed clusters to write code, and it will give you a real second response.

Total

Conclusion

Using Azure Synapse Analytics to quickly set up an enterprise-class modern data warehouse, the core industrial essence is actually simplified into 16 words:

Power diversion, penetration exploration, total locking, mass throughput.

.

You have completely bid farewell to the original misery of begging your grandfather everywhere to sue your grandmother to ask for different system guidance data, being afraid of running big reports to get stuck in the online system, and overflowing your hair every day for the virtual machine memory. All the heaviest massive computing pressure is fully managed to the distributed MPP cloud native brain built by Microsoft with tens of billions of dollars. Sitting in front of the computer, gracefully pulling open a beautiful data market, calmly watching hundreds of millions of data in the blink of an eye tamed obedient, this is the modern modern data era architects the most elegant cash posture.

1
← 返回新闻中心