JSON vs Structs vs Columns in BigQuery - Observations

Shout out to Josh Peng and Claus Herther who inspired me to write this post!



A common way of logging and delivering data from production systems is via the JSON format. JSON allows for a flexible schema that supports nested value pairs and arrays. In BigQuery we have a few options to consider when choosing how to store this data for use in BigQuery.


  1. JSON string column with BigQuery JSON functions

    1. Pros: 

      1. Easiest to use directly from the source system

      2. Flexible schema as source data changes

    2. Cons:

      1. We lose BigQuery’s columnar data storage benefits. Every query will pay the cost of hitting the entire JSON object.

      2. JSON functions are somewhat limited

  2. Transform the JSON into a Structs

    1. Pros:

      1. Depending on your use case this may better fit your data needs, especially if you have nested arrays in your JSON

      2. Efficient use of columnar data storage

    2. Cons:

      1. Lack of schema flexibility

      2. May be new territory for SQL users who are not used to using nested fields

  3. Transform the JSON into a standard table with columns

    1. Pros:

      1. Easiest to use for most SQL users

      2. Efficient use of columnar data storage

    2. Cons:

      1. Lack of schema flexibility

      2. Additional storage for repeated rows


Additionally, we will make use of the recently released json_extract_array function that makes it much easier to work with arrays nested from our source JSON.

Setup for this demonstration

This post assumes you are already familiar with BigQuery and Colab. If not please check out our other post Google BigQuery and Colab - Getting Started and come back when ready.


Our example source JSON data is a list of United States state populations from the datausa.io API.


Source data JSON sample:


{
  "data": [
      {
          "ID State": "04000US39",
          "State": "Ohio",
          "ID Year": 2018,
          "Year": "2018",
          "Population": 11689442,
          "Slug State": "ohio"
      },
      {
          "ID State": "04000US01",
          "State": "Alabama",
          "ID Year": 2018,
          "Year": "2018",
          "Population": 4887871,
          "Slug State": "alabama"
      }
  ]
}


As we can see from this JSON sample. We have a JSON object that contains a “data” key, the “data” key’s value is an array of JSON objects (states) with various other key/value pairs we would like to query!


Use this Colab Notebook (BigQuery, JSON, Structs, Arrays, Nested, Repeated Observations) to follow along with the rest of this post. 

JSON string column with BigQuery JSON functions

Given our source is JSON, we can create a basic BigQuery table that stores the raw JSON as a single string column.


For this example, we manually copy/paste the JSON from https://datausa.io/api/data?drilldowns=State&measures=Population to create our base JSON table:


query_job = run_query("""create or replace table us_state_populations as
select
'{"data":[{"ID State":"04000US39","State":"Ohio","ID Year":2018,"Year":"2018","Population":11689442,"Slug State":"ohio"},{"ID State":"04000US72","State":"Puerto Rico","ID Year":2018,"Year":"2018","Population":3195153,"Slug State":"puerto-rico"},{"ID State":"04000US01","State":...}'
as us_state_population_json""")


*Actual code in notebook*


Check the size of our table in bytes, this example is quite small, only 37KB so we don't have to worry about our BigQuery budget.


table = 'us_state_populations'
table_size = table_size_check(table)
print(f'{table} is {table_size} bytes.')

us_state_populations is 37725 bytes.


Now, let's use BigQuery JSON functions to extract relevant information from this table:


query = """
  select
    json_extract_scalar(b,'$.ID State') as state_id,
    json_extract_scalar(b,'$.State') as state_name,
    cast(json_extract_scalar(b,'$.Population') as int64) as state_population
  from us_state_populations a, unnest(json_extract_array(a.us_state_population_json,'$.data')) as b
"""

# How many bytes will this query process?
print('This query will process {} bytes.'.format(query_size_check(query)))

This query will process 37725 bytes.


As we can see, this query processes the full size of the table / JSON (37,725 bytes) and returns the results we expected. Now that we have this data in rows, we can use this just like any other query.


The Takeaway: Running queries directly against the JSON is flexible, but it can get costly. As we saw above, we processed the entire JSON object (37,725 bytes) even though we only accessed three columns. As we will see next, there are opportunities to store this data differently to take advantage of BigQuery's columnar storage and query optimization.

Transform the JSON into a Structs

Leverage BigQuery's support for Struct / Array data types. In doing so, we will gain the advantage of BigQuery's columnar storage optimizations and other enhancements.


For this example, we will use the JSON from our base table to create a new table using a struct to store this same data:


query_job = run_query("""
  create or replace table us_state_populations_struct as
  select
      struct(
        json_extract_scalar(b,'$.ID State') as state_id,
        json_extract_scalar(b,'$.State') as state_name,
        cast(json_extract_scalar(b,'$.ID Year') as int64) as state_id_year,
        cast(json_extract_scalar(b,'$.Year') as int64) as state_year,
        cast(json_extract_scalar(b,'$.Population') as int64) as state_population,
        json_extract_scalar(b,'$.Slug State') as state_slug
        ) as state
  FROM
    us_state_populations a, unnest(json_extract_array(a.us_state_population_json,'$.data')) as b
""")


*Note our use of the new json_extract_array function here!

json_extract_array allows us to turn our JSON array into a BigQuery array, to which we can apply the unnest function to get a row for each record of the array. Very handy!


Check the size of our table in bytes:


table = 'us_state_populations_struct'
table_size = table_size_check(table)
print(f'{table} is {table_size} bytes.')

us_state_populations_columns is 17604 bytes.


As we can see, we immediately get a table storage size savings. Instead of paying to store 37,725 bytes of data we are paying to store 17,604 bytes! Obviously, for a table this small this savings doesn't really matter, but for a much larger table this could be a big savings!


Now, let's run a similar query to what we did above, against our new struct table:


query = """
  select
    state.state_id,
    state.state_name,
    state.state_population
  from us_state_populations_struct a
"""

# How many bytes will this query process?
print('This query will process {} bytes.'.format(query_size_check(query)))

This query will process 9270 bytes.


As we can see again we have a significant savings in the number of bytes scanned. Instead of scanning 37,725 bytes we scanned 9,270 bytes, 24% of our original JSON query bytes scanned! This is because BigQuery is able to only read the columns we need from the underlying columnar storage. Additionally, the columns are easier to access as we do not have to use the JSON functions.


The Takeaway: Using structs saves us both storage and query bytes, but we lose the flexibility of the flexible JSON schema. Now we have to change our struct table transform when the underlying JSON changes.


Now let's take a look at creating standard table columns with this data.

Transform the JSON into a standard table with columns

Leverage BigQuery's standard column data types. In doing so, we will gain the advantage of BigQuery's columnar storage optimizations and other enhancements. This will also be the easiest for most SQL users.


For this example, we will use the JSON from our base table to create a new table using a standard columns to store this same data:


query_job = run_query("""
  create or replace table us_state_populations_columns as
  select
    json_extract_scalar(b,'$.ID State') as state_id,
    json_extract_scalar(b,'$.State') as state_name,
    cast(json_extract_scalar(b,'$.ID Year') as int64) as state_id_year,
    cast(json_extract_scalar(b,'$.Year') as int64) as state_year,
    cast(json_extract_scalar(b,'$.Population') as int64) as state_population,
    json_extract_scalar(b,'$.Slug State') as state_slug
  FROM
    us_state_populations a, unnest(json_extract_array(a.us_state_population_json,'$.data')) as b
""")


Check the size of our table in bytes:


table = 'us_state_populations_columns'
table_size = table_size_check(table)
print(f'{table} is {table_size} bytes.')

us_state_populations_columns is 17604 bytes.


As we can see, we get the same bytes storage savings as our struct approach. We immediately get a table storage size savings. Instead of paying to store 37,725 bytes of data we are paying to store 17,604 bytes! Obviously, for a table this small this savings doesn't really matter, but for a much larger table this could be a big savings!


Now, let's run a similar query to what we did above, against our new table:


query = """
  select
    state_id,
    state_name,
    state_population
  from us_state_populations_columns a
"""

# How many bytes will this query process?
print('This query will process {} bytes.'.format(query_size_check(query)))

This query will process 9270 bytes.


As we can see again these results match our struct table findings. We have a significant savings in the number of bytes scanned. Instead of scanning 37,725 bytes we scanned 9,270 bytes, 24% of our original JSON query bytes scanned! This is because BigQuery is able to only read the columns we need from the underlying columnar storage. Additionally, the columns are easier to access as we do not have to use the JSON functions.


The Takeaway: Using standard columns saves us both storage and query bytes, the same as our struct approach, but we lose the flexibility of the flexible JSON schema. Now we have to change our table transform when the underlying JSON changes. This table will likely be easiest to use for anyone who is used to SQL from other data warehouses.

The final word...


Our recommendation would be to primarily leverage JSON strings for base tables or what you might consider underlying data lake esque tables. This way you can take advantage of the flexible schema JSON provides. You should transform this JSON data to structs or columns for analytics and query use. Another option would be to transform the JSON to structs or columns via stream processing before it lands in BigQuery, but this has other factors to consider.


There are additional benefits to using standard columns or structs in the BigQuery Web UI. Tables are easier to preview and explore compared to JSON strings.


Keep an eye out for our next post which will look into further ways to optimize this data using a combination of arrays and structs.


Comments

  1. Designed by Thomas Bastide, the clear-cut traces, corners and roundness are combined to mirror the light. Like a gently reduce precious stone, the Octagone 코인카지노 vase is a true gem to embellish each ambiance and illuminate the finest bouquets. Resting on a white marble base, the crystal bowl is adorned with a up to date geometric reduce, reflecting mild to infinity. Like his iconic creations, the star designer subtly blends the codes of the previous and present to supply a modern and dazzling luxurious. Starck transforms the Harcourt glass right into a lighting fixture. The delicate but trendy type seems as if blossoming from the slim suspended tubing.

    ReplyDelete

Post a Comment

Popular posts from this blog

Google BigQuery and Colab - Getting Started

Origins

Translate