Ken is at the PASS BA conference this week, so it seemed like a perfect time for me to publish my first Power Query post here. In this installment, I'm going to show how to map columns between data sets.
Exploring how to map columns between data sets
I’m on a great new project where Power Query is the bread and butter of the solution. We’re pulling design information from Engineering Design Systems and building transforms to load into another engineering application. This application has a very strict requirement for data layout. Needless to say, the data structures and field names are seldom consistent between the two applications, so a key part of the transform is to map columns from one data set to the other.
Generally, Power Query’s ability to insert, rename, and move columns is useful in a case like this, however we are doing this for a large number of different data transfers, developing the maps in an ongoing process, and I REALLY don’t want to rewrite the Power Query steps for every change in every transform. (Also being able to document the transform is important for design and debugging).
Here’s a simplified example:
Source Data: The Flintstones
The Flintstones Sample data in a small table provides the source of data to be mapped
Target Data Structure: M*A*S*H
What does M*A*S*H have to do with the Flintstones? Not much really, that’s the point. But I want to convert data to this layout.
The "Map Table": The key piece needed to map columns between data sets
I set a goal to write a power query transform that was agnostic of specific column names and field counts, and so would not use Table.AddColumn, Table.RenameColumns, Table.RemoveColumns, or Table.ReorderColumns operations.
Mapping Strategy/Assumptions:
- Data will not necessarily end up in the same column order between Source and Target.
- Not all the columns map from Source to Target.
- Not all the Target columns will be filled from the Source.
The solution: A "Map" table on an Excel sheet; A simple list of Source field names and Target Field names (I like using a column format for readability).
The Transforms
The "TblMap" Transform
The query reads the "Map" table data and flips it around so that the Source names are the table’s column names:
The complete M code used for this solution is shown here:
The "Output" Transform
This query references the tblMap transform and appends the original source data, giving something like this:
Now just promote the first row to Headers, overwriting the existing column names, and the new Target data structure is in place:
Dealing with un-mapped columns
But what about those pesky un-mapped columns (Column7 and Column8)? Normally I would use Table.RemoveColumns. I don't want to do that here, though, as this would hard code column names into the M code that might not exist next time, resulting in errors.
Instead, we just transpose the table and filter out any columns that begin with “Column”, and transpose it back. The complete M code for the query is shown here:
And here is the output in Excel once we load it to a table:
Closing Thoughts
So there you go. One of the best things I like about this approach is how flexibly it can be modified. Spell “Klinger” wrong? Just modify the spelling in the Map table. Forget to add Rizzo or Nurse Able to the Target? Just add them to the table on the Target side and they are in the result. Forgot to include Dino in the Source data? Just add him to the list.
The sample file is attached. Give it a try. Hope it can be useful.
A thought on Data Types
I have not done a lot of testing with data types on this approach. My work will not do any math on the contents until after the re-mapping (I hope), so data typing can be done at the end. If there is any math to be done in the middle of the process, you would need to be careful not to have power Query treat numbers as integers (this has bitten me before).
Performance
The last step where extra column names are removed uses a transpose which could be really slow for long data sets. Another solution that could fix this would be to create a list from the Map table to automate a RemoveOtherColumns function.
5 thoughts on “Map Columns Between Data Sets”
Great solution, thanks for sharing
There is something very satisfying about solving these sorts of problems
Hi Alex 🙂
Nice to see a person with polish last name (just like me).
I would like to propose another approach to the problem (without transpose and other operations). I think, this should be fast and efficient for big tables (i did not try)
let
Source_tblMAP = Excel.CurrentWorkbook(){[Name="tblMAP"]}[Content],
ReorderColumns_tblMAP = Table.ReorderColumns(Source_tblMAP,{"SOURCE", "TARGET"}),
OldNewColumnNames = Table.ToRows(Table.SelectRows(ReorderColumns_tblMAP, each ([SOURCE] < > null))),
SourceData1 = Excel.CurrentWorkbook(){[Name="tblSource"]}[Content],
RightSourceDataTable = Table.SelectColumns(SourceData1, List.RemoveNulls(Source_tblMAP[SOURCE])),
NewHeaders_SourceData = Table.RenameColumns(RightSourceDataTable,OldNewColumnNames),
ResultTbl = Table.Combine({Table.FromRows({},ReorderColumns_tblMAP[TARGET]), NewHeaders_SourceData})
in
ResultTbl
Regards 🙂
Bill, thanks for your response. (Maybe we should ask them to rename it "Polish Query"?). Your approach is definitely more sophisticated, and elegant too. Think it might make a good function??
I'm particularly happy to see how you use the list components in the solution - this is an area where I need to improve my knowledge.
So I decided to build Bill's query as a function. I also added a filter to remove any nulls from the [TARGET] column of tblMAP, which cause failures. Here it is:
(SourceData1,Source_tblMAP)=>
let
ReorderColumns_tblMAP = Table.ReorderColumns(Source_tblMAP,{"SOURCE", "TARGET"}),
#"Filtered Rows" = Table.SelectRows(ReorderColumns_tblMAP, each ([TARGET] < > null)),
OldNewColumnNames = Table.ToRows(Table.SelectRows(#"Filtered Rows", each ([SOURCE] < > null))),
RightSourceDataTable = Table.SelectColumns(SourceData1, List.RemoveNulls(Source_tblMAP[SOURCE])),
NewHeaders_SourceData = Table.RenameColumns(RightSourceDataTable,OldNewColumnNames),
ResultTbl = Table.Combine({Table.FromRows({},#"Filtered Rows"[TARGET]), NewHeaders_SourceData})
in
ResultTbl
Check out DataChant's posting on using List.Zip to do this!
http://datachant.com/2017/01/26/power-bi-pitfall-6/