I’ve been in this industry for quite a while and have seen a lot of trends and approaches come and go in that time. Some are still with us, some have fallen out of favour just as quickly as they became fashionable in the first place.
A framework for implementing a successful BI solution
I have tended to build Data Warehouses almost exclusively using the Kimball Methodology. For me it’s a framework for implementing a successful Business Intelligence solution, and while on its own it can’t guarantee success, with good business requirements and a good design and implementation team it can get your organisation a long way down that road.
The Kimball Methodology was developed by Ralph Kimball who came out of that hot-house of innovation in the early ‘70s - the Xerox Palo Alto Research Center (PARC). The institution that gave the world the laser printer, the first computer to use a mouse and the graphical user interface complete with icons.
The Data Warehouse methodology he came up with is essentially a “bottom-up” approach – which means that you can get going without understanding all the requirements from across the business and build a single subject area data mart (also called a Star-Schema database due to its appearance when drawn comprising a single large fact table in the middle containing what you want to measure surrounded by linked tables called Dimensions containing what it is you’re measuring). A more formal data warehouse can be constructed over time as further subject area data marts are included and re-use the same dimensions which a common meaning throughout the enterprise. These dimensions are called conformed dimensions and how they are used across multiple fact tables and where that intersection occurs is called the bus matrix.
The advantages are that you can get going without needing to know the full picture and add iterative functionality over time without needing a big-bang approach. Where these subject areas or fact tables can be linked by common business dimensions, it is possible to get an integrated enterprise wide view through reporting from the data presented in this manner.
There have been other approaches over the years. When I first started, the great debate was whether to use Kimball’s or Inmon’s approach. Bill Inmon developed a “top-down” approach which differed in that you had to know and understand all your business requirements and processes up front so that an enterprise data model could be developed first. The solution typically used a third normal form database and subject area data marts could be formed within the data warehouse from the whole. Typically, implementation times using Inmon’s methods were much slower and costs can be higher, so businesses traditionally saw a slower return on investment by using this path. In addition, due to the big-bang / model-the-world approach usually favoured, more projects tended to fail than succeed.
“A data lake is more like a giant tub of assorted Lego bricks and no defined plan as to how to put it together and some of the bricks will be non-standard”
In the modern era, data is far more readily available than at any other point in history and can come from all sorts of sources both internal and external and with far more frequency than previously. This has led to new techniques and methods for handling big-data such as the Data Lake. The best definition that I’ve read for the data lake is by James Dixon the founder of Pentaho who said that “If you think of a data mart as a store of bottled water – cleansed, packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and various users can come to examine, dive in or take samples.”
I prefer to think of these approaches as Lego sets, if you think of a Kimball designed data warehouse or data mart as one of those sets of Lego which comes with a picture on the box, and a set of instructions (framework) for getting the data you need to make a decision. A data lake is more like a giant tub of assorted Lego bricks and no defined plan as to how to put it together and some of the bricks will be non-standard – it’s up to the person playing with the bricks to assemble it how they see fit to meet their needs. This is the transform on request model that users of the data lake need to adopt as the data they need could be still in a raw unstructured form and will need to be transformed to combine with corporate data in order to make a business decision.
The Data Warehouse and the Data Lake can Co-exist
In light of these new techniques and masses of data that are now available, I believe Kimball still has relevance and still has a place in modern business. The needs of the business user will still rely more heavily on the structured corporate data to make day-to-day decisions. I believe the best decisions in the future will be those made where the users have the most information possible, this data is not always available from within corporate systems, so the opportunity to combine data from multiple sources and add context to the data warehouse from the data lake will benefit the modern enterprise. But as with all things, this should only be done where there is a genuine business need.