On this blog, I showcase a lot of different techniques for manipulating and reshaping data. For anyone that follows the blog, you already know this, and you know it's a pretty important topic to me. But the thing we shouldn't lose site of is WHY we do this. It's to drive analytics. I'm fairly convinced that the majority of the loyal readers here already know this. Thus, I wanted to ask your opinion on something...
How do you design your data model?
What I'm specifically interested in is how you approach designing the Fact and Dimension tables you use for your Power Pivot model. And I'm not specifically talking about Power Query here. We all know you should be using what you learned from our recently relaunched Power Query Academy to do the technical parts. 😉
What I'm more interested in is the thought process you go through before you get to the technical bit of doing the data reshaping.
If you read books on setting up a data model, you'll probably be told that you need to do the following four steps:
- Identify the business process
- Determine the grain of the model
- Design your Dimension tables
- Design the Fact tables
So if you're asked "how do you design your data model", do these steps resonate with you, and why?
Do you consciously sit down, and work through each of these steps in order? I suspect that many self-service BI analysts skip the first step entirely as they are implicitly familiar with their business process. (As a consultant, I ask a lot of questions in this area to try and understand this before building anything.)
Do you design the reports on paper, then work backwards to the data you'll need, go find it and reshape it? Or do you go the other way, trying to collect and reshape the data, then build reports once you think you have what you need?
Do you explicitly define the model grain? And if you do, what does that mean to you? Is it restricted to "I want transactions at an monthly/daily/hourly basis"? Or do you do deeper like "I want transactions at a daily basis and want to break them down by customer, region and product"?
Why the question?
There's actually two reasons why I'm asking this question:
Reason 1 is that I'd I think healthy discussion makes all of us better. I'd like to hear your thoughts on this as I'm probably going to learn something that I haven't discovered in my own learning journey.
Reason 2 is that my whole business is around teaching people how to do these things, and I'm always looking to make things clearer. The more opinions I hear (even if they contrast with each other), the more I can help people understand his topic.
So sound off, please! We'd all love to hear how you approach the task of building a data model.
8 thoughts on “How do You Design Your Data Model?”
I'm going to be honest, I rarely build data models. Almost all of the Power Query work I do, and quite a few others in my company are in the same boat, is for taking data in one format and putting it in a different format that a different system/customer/business unit/etc. needs in order to work with it. Think a customer sends us a report, and it needs to have different column headers in a different order to upload into our system.
When I do build data models, they're rarely complicated enough to need Dimension/Fact tables, usually there's one, maybe two tables that I'm analyzing and that's that. I'm thinking situations where we have weekly csv's that need to get rolled up and then a total count per month or business unit, that kind of thing.
On the very rare occasions I need a full data model like this, I do sit down and think through the process and the granularity, but I've never sketched it out first as it's usually easier for me to sketch it out in Power BI, as it were.
Hi Ken! How are you? My approach is the following.
Date table using a template created with PQ. DimProduct, DimStore, DimCustomer, etc usually generated by combining staging queries of Fact tables (Sales, Inventories, Purchases, etc) that contains enough columns for generating dimension tables but probably won’t go with this approach if the datasets contains millions of rows. Might create queries or files like data flow in advance if Product master is not available.
If the Fact tables should include tables such as Raw materials, WIP, Parts whose columns and cardinalities are so different than finished goods, tables for header normalization are needed before ingesting datasets to Data Model. If the business requirement is to be able to drill down to the bottom of the line for any kind of asset -FG, RM, WIP, parts, then header names must be normalized for all these assets if one should combining these asset together but if the high-level summary is enough, one might need create a calculated table in power bi that only stores necessary information -Date, Business Entities, Assets, Inventory Quantity, Inventory Amount, Sales Quantity, Sales Amount, SalesCost, etc by utilizing DAX with variables that store temporary tables of these assets before combining them all together with UNION.
After all, it’s the business requirement that matters most as to how to design your model and one should pay enough attention to the efficiency of data refresh.
I start with the question to be answered, hypothesis to be tested or business process/results to monitor.
Then the grain.
Next, I explore the data available (Databases, APIs) which is generally fairly straight forward in the domain I specialize in (car dealers).
I will then determine the Dimension and Fact Tables required and pull them into PowerPivot and create exploratory Pivot Tables. Performance issues related to Data Refresh help me determine joins and where the processing should happen i.e. with custom SQL Queries or let Excel do the crunching. I usually mock up the report(s) in this step and finalize in the last step.
Next, report presentation constraints (SharePoint or PowerBI workbook size etc) will sometimes require some modifications in the data model.
Last, I finalize the report and presentation details.
Thanks guys, and definitely still interested in others experiences. 🙂
I generally have the person requesting the report give me the report in Excel of how they would like to see it. Not a working report, but make them think through the rows, columns, how report pages would look, etc. What kind of charts, or in the case of tables, what kind of rows, columns, and how would it drill down/expand/collapse, etc.
Once they spend some time on that, with my input and help if they like, I then have a much better idea of how the model should look, including granularity.
Hi Ken,
My approach is kind of a reverse engineering.
First, I will start from the reports users want to see.
My experience told me that (most) users have no idea on latest Excel or BI tools. They are very much getting used to "traditional" Excel reports. They would show me, I want this and that in this and that format (of table). I often have to state: "Don't mind the format of the output at this stage, tell me what dimensions you need; what time intervals you would be interested in; how you slice and dice; what is the purpose of this and that?" We all know, a lot of numbers on a "traditional" report are there just for "reference". They are being "unread" for most of the time.
Second, I will ask the questions: How are you doing your existing reports currently? Don't tell but show me. This is how I understand where the data is coming from. What levels of "manual" transformation is required (most of them can be done by Power Query of course).
Third, I will ask the questions: How often the report needs to be updated? How much time is expected to get the data required? Of course you may ask... why don't we connect to the data source directly? Haha... in real world it is, for whatever reasons, not that easy.
Finally, get the data and action!... And be prepared to bug-finding and fine-tuning.
In my experience, building the model is not the most time-consuming part. Talking to the report users is. (In many cases, report requester has no idea what he/she wants. He/she just says: "I want a Dashboard for this and that..."
Cheers,
MF
What a great question to your readers!
As your readers point out, the approach varies depending on who you are and what you need to do. For me, I work in a large organization and my data comes from ERP and OLTP systems that dictate the form in which I will receive my data. I don't have expertise in all the sources so I have to work with IT to design queries that have the data I think I need and the fields on which I can build relationships. It may not be a perfect data model form but as long as it suffices I'm OK.
Designing fact and dimension tables now comes down to query performance. It would be nice to have lots of elegant dim tables with only a few columns, but if they're coming from multiple data sources and I need crosswalks between (say) locations because source A uses GLN and source B uses tax parcel, my lookup tables start to balloon. And the fact tables are enormous - millions of rows per year of data with 40+ columns of relevant data for filtering. IT doesn't mirror the data, nor can it run the risk of runaway queries on the OLTP system, so I get downloads of .csv files rather than getting SQL directly against production. Queries run on local machines so self-joins take almost an hour on individual fact tables - better to pull in the raw columns and use DAX for the references.
At this point you're probably scenting a consulting opportunity to streamline these activities, and you wouldn't be wrong, but our organization does not have the bandwidth to build a proper Power BI infrastructure. Nor do we know enough to lock one into place - there are so many dark corners where analysis is needed that we spend our time on finding and answering questions rather than creating infrastructure. Short-sighted perhaps, but again it's a result of limited bandwidth.
Initially modelling to a business process is secondary to getting usable data. If I can get the data relevant to a process then I will worry about tuning the data model later - the primary question when dealing with multiple sources is verifying that the numbers work. Calibration is critical - in my model, do my filters and measures give the same results as long-standing SQL queries or the standard ERP screens? If not, back to IT to reconcile the errors, assuming my DAX is correct.
Now I have calibrated data and I can start to ask questions relevant to the business process. But, one question begets another, and if you do it right pretty soon what you thought you needed to know has morphed into a lot of deeper questions you need to ask. Not many business processes survive this kind of analysis. You can certainly run reports for the current process, but you've already realized that your process isn't doing what you thought it did.
More data, more calibration, more questions... and that's why we love the Power family of tools.
When I design a model, I try to answer the why, what, and who first, only then I will tackle the how.
Why: why do we need this report? Is there something else of similar nature in the business where I can leverage on? i.e. instead of building a whole new model I can just improve the current model to make it work better.
What: What are the key drivers of the data (cost drivers, if for a financial report), this will in turn decides the grain. The more drivers you have, the more dynamic your model needs to be in order to handle multiple "what-if" scenarios.
Who: who are the main users of my report - e.g. if they are big shot managers who rarely have time to spend more than 30 seconds reading any report, then high-level reporting and Power BI visualizations it is. Another who question to ponder is, who is going to cover for me when I go on leave? the model needs to be elegant, efficient, yet at the same time easy to understand if my team mates are not that strong on the BI front.
Normally after I'm done with the why, what, who - the how is 70% clear.