Unexpected spikes in your data storage or processing costs can be alarming, often resulting from inefficient data management, outdated retention policies, or sudden increases in query activity. Left unchecked, these spikes can lead to significant budget overruns, making it crucial to act swiftly. It’s common to retain information that holds little immediate value but cannot be discarded due to compliance, legal, or potential future requirements. This data occupies valuable storage space and can incur substantial costs.
Understand the root causes
Let’s explore how to investigate and optimize your Google Cloud Platform’s costs, focusing on analyzing billing reports and optimizing BigQuery storage costs. Say, you’ve noticed a significant increase in your GCP costs after you set up budgets and billing alerts. Go to Billing-Reports, and ask Gemini to help you find the source of expenses. Gemini will provide you with data similar to this:
To identify the projects impacted by Long Term Storage fees, filter the billing report by SKU. Once you’ve applied this filter, you’ll see a list of all projects incurring these costs. You’ll need to address each project individually to clean up and reduce or eliminate Long Term Storage fees. For this example, I’ll assume you only have a single project affected.
Google Cloud Storage as an alternative
Next, navigate to the Google Cloud Storage and create a new GCS bucket. This bucket will serve as the repository for your unused data. Select a GCS region that aligns with your data access needs and compliance requirements. Consider factors like proximity to users, data sovereignty regulations, and network latency.
GCS offers various storage classes, each with distinct cost and performance characteristics. For infrequently accessed data, opt for storage classes like Coldline Storage or Archive Storage, which offer lower costs but may have higher retrieval fees and latency. If you anticipate occasional access, Nearline Storage or Standard Storage might be more suitable. I used a GCS bucket with the archive storage class and called it g_ads_history_backup.
BigQuery SQL scripts and bash commands
To identify BigQuery tables with a high potential for storage gain, you can filter them based on the difference between their total physical bytes and total logical bytes. You can also filter tables based on their last accessed or modified dates to identify tables that are no longer actively used. To estimate the initial data size, run the following query for the dataset with the highest storage cost:
SELECT
ROUND(SUM(size_bytes) / 1024 / 1024 / 1024) AS table_size_gb
FROM
my_dataset.__TABLES__
WHERE
DATE(TIMESTAMP_MILLIS(last_modified_time)) < '2025-01-01'
Adjust the date or add other conditions to filter unused tables. In my case, the size of the tables was 748GB. Next, list all unchanged tables with their size:
DECLARE
gcs_path string DEFAULT 'gs://g_ads_history_backup/';
DECLARE
bq_command string DEFAULT 'bq extract --destination_format PARQUET --compression GZIP';
SELECT
table_id,
IF
(ROUND(size_bytes / 1024 / 1024 / 1024) > 0, CONCAT(bq_command, ' my_dataset.', table_id,' ', gcs_path, table_id,'/*'), CONCAT(bq_command, ' my_dataset.', table_id,' ', gcs_path, table_id)) AS bash_command,
ROUND(size_bytes / 1024 / 1024 / 1024) AS table_size_gb
FROM
my_dataset.__TABLES__
WHERE
size_bytes > 0
AND DATE(TIMESTAMP_MILLIS(last_modified_time)) < '2025-01-01'
ORDER BY
size_bytes desc
This SQL query not only extracts data but also generates bq CLI tool commands that automate and streamline the transfer of data from BigQuery to Google Cloud Storage (GCS), reducing manual effort and potential errors.
The data will be stored in a compressed and efficient format using GZIPed .parquet files. Additionally, the commands incorporate a wildcard mechanism to split the data into chunks if the table size exceeds 1GB, ensuring optimal handling of large tables. Upon execution, the query returned a substantial result set of 86 tables.
You can keep a record of these query results for future reference and analysis by storing these results in a dedicated BigQuery table specifically designed for logging purposes. In this particular scenario, I used my_dataset.tables_cleanup table.
Now, execute bash commands for each of the specified tables. You have the option to utilize bash looping for efficiency or handle them individually. In my experience, processing 86 tables took approximately 15 minutes. Once all the commands have been successfully executed, proceed to truncate all the tables that are explicitly listed within the my_dataset.tables_cleanup table. To accomplish this truncation, use the provided script:
FOR record IN (
SELECT
table_id
FROM
my_dataset.tables_cleanup) DO
EXECUTE IMMEDIATE
CONCAT( 'TRUNCATE TABLE my_dataset.', record.table_id );
END
FOR;
I used table truncation to maintain the table schema for future use. This practice ensures that the table structure remains intact if you need to repopulate or reference it later. If preserving the schema is not a concern, then truncating or deleting tables can be equally effective methods for removing data. My 86 tables were truncated in approximately 17 minutes. As a final check, run the following command in the GCP cloud shell to get the size of the compressed data:
gsutil du -s -a gs://g_ads_history_backup/
I got 33GB in GCS which is 22x less than the initial BQ data. Note, that you’ll also have to wait for time travel storage to expire:
Conclusion
By following a structured approach—analyzing billing reports, identifying costly storage patterns, and transferring infrequently accessed data to cost-efficient Google Cloud Storage—you can significantly lower your GCP expenses. Addressing long-term storage fees and optimizing BigQuery table usage helps minimize waste while maintaining accessibility and compliance. Leveraging automation and strategic data management ensures continued cost savings and efficiency.