Junk Dimension practice in Big Data

Sumi
2 min readOct 24, 2021
Photo by Keith Pitts at keithmelissa.com; Instagram @keithmpitts.

What’s junk dimension?

Junk dimension as a data warehousing technic, is a convenient grouping of typically low-cardinality flags and indicators. By creating an abstract dimension, these flags and indicators are removed from the fact table while placing them into a useful dimensional framework.

Imagine you have a user fact table, in very simplified term, you have 2 fields, only has gender and married:

You can restructure the table to be, with one dimension table:

What’s the value of Junk dimension?

  1. Ease of managing metadata together: if you have 10 dimensions in your fact table. You may need 10 dimension tables to store the metadata. You can manage 1 junk dimension instead. This is more beneficial when your dimension is very domain specific. You don’t share a lot dimension with other fact tables. Or you have a few fact table share the same junk dimension.
  2. Storage efficiency: in the example above, if we have 1 billion user information to store, and the dimension has only a handful of combination. The space saving is huge. This is more beneficial when your dimension is low cardinality.
  3. Computation efficiency: one application for data warehousing is reporting, and reporting would need to compute lowest granularity dimension combo aggregations. It will save the shuffle(each dimension) — sort — merge to only sort — merge.
  4. Data quality control: you can incorporate business logic on junk dimension and plug in data quality control framework. For example, in junk dimension, you may have both age_group and account_type fields, a <13 years age_group account should not be a parent account_type.

Big Data Junk Dimension Considerations

  • Cardinality: high dimension + low cardinality would be an ideal scenario for junk dimension. As a rule of thumb junk dim table should be able to fit in memory for a map side broadcast join to avoid the shuffle.
  • Metadata evolving: in real world, metadata always is evolving. a new type of gender, a new app version released. It’s recommended to regenerate your junk dimension every time an underlying dimension value changed. In order to backfill history, this table need to keep all the history for the dimension space. Accompany with effective start and end date for any records.

--

--