How to create a merge table in BigQuery

2 min readMar 24, 2024

In this post, I introduce the merge function in BigQuery.
The use cases where we utilize it are mainly to eliminate data duplication. It is a powerful tool that allows us to manage and maintain data integrity. By merging data efficiently, we can ensure our datasets are free from redundancies.

Merge Query

Basic Query

MERGE dataset.master M
USING dataset.temp T
ON
  M.id = T.id
WHEN MATCHED THEN
  UPDATE SET
    M.hoge = T.hoge,
    M.fuga = T.fuga
WHEN NOT MATCHED THEN
  INSERT ROW

With partition tables

if the table we merge is partitioned, we can’t use 「INSERT ROW」statement. If you use it, you can see the following error message.

「if the target is ingestion-time partitioned table, this won’t work you have to specify the columns:」

MERGE dataset.master M
USING dataset.temp T
ON
  M.id = T.id
WHEN MATCHED THEN
  UPDATE SET
    M.hoge = T.hoge,
    M.fuga = T.fuga
WHEN NOT MATCHED THEN
  INSERT(
    hoge,
    fuga
    )
  VALUES(
    hoge,
    fuga
  )

With One to many tables

If the relationship between the master and temp tables is one-to-many, we can use the composite key.

MERGE dataset.master M
USING dataset.temp T
ON
  M.id = T.id
AND
  M.status = T.status
WHEN MATCHED THEN
  UPDATE SET
    M.hoge = T.hoge,
    M.fuga = T.fuga
WHEN NOT MATCHED THEN
  INSERT ROW

Data manipulation language (DML) statements in GoogleSQL | BigQuery | Google Cloud

If you use on-demand billing, BigQuery charges for data manipulation language (DML) statements based on the number of…

cloud.google.com

How to improve the MERGE efficiency

We have three options to improve the MEREGE efficiency.

subquery filter

MERGE dataset.master M
USING (SELECT * FROM dataset.temp WHERE _PARTITIONTIME = '2018-01-01') T
ON M.id = T.id
WHEN MATCHED THEN
  DELETE

`search_condition`

MERGE dataset.master M
USING dataset.temp T
ON M.id = T.id
WHEN MATCHED AND M._PARTITIONTIME = '2018-01-01' THEN
  UPDATE SET id = S.id
WHEN MATCHED AND M._PARTITIONTIME = '2018-01-02' THEN
  UPDATE SET price = S.price + 10

`merge_condition`

MERGE dataset.master M
USING dataset.temp T
ON M.id = T.id AND
  M._PARTITIONTIME = '2018-01-01' AND T._PARTITIONTIME = '2018-01-01'
WHEN MATCHED THEN
  UPDATE SET M.status= T.status

How to reduce the cost of BigQuery incremental data ingestion pipelines | Google Cloud Blog

Improve performance of BigQuery incremental data ingestion pipelines by pruning more data during merge step

cloud.google.com

Reference

DWH上のマスターテーブルを重複なく効率的に更新する

本ページでは、DWHのテーブルに対し、「データの重複なく」「効率的に」変更を反映する方法をご紹介します。転送先Google BigQueryには、 UPSERTや MERGEといった転送モードがありません。…

documents.trocco.io