Keep rows that begin with alphabetical characters

Even though this hits on techniques used on this blog before, a colleague asked today “I have a lot of garbage names in a column and only want to keep rows that begin with an alphabetical character.”  I figured we’d explore that here.

The issue

The data at hand has a bunch of garbage records, as well as some that are valid.  Anything that starts with a number, punctuation or other special character needs to go, leaving only items that start with alphabetical characters.

So basically, we want to take the table in blue, and convert it to the table shown in green, below:

image

Click here to download the workbook with both the raw data as well as the finished solution.

Getting started

Naturally, the first thing we need to do is bring the data into Power Query:

  • Select a cell in the data range
  • Create a new query –> From Table

Now, with the data in the table, our first temptation is to immediately try to come up with a way to figure out how to filter to only alpha characters.  Unfortunately, there are 26 of them, and a custom filter will only allow 2 at a time:

image

Okay, that won’t work, so we lean to custom columns…

We could write a function to do it that looks like this:

=if Text.StartWith([Name],”A”) then [Name] else if Text.StartsWith([Name],”B” then [Name] else… “Remove”

Of course, we would replace the … with a whole bunch more nested “if” statements to do the rest of the alphabet.  Very slow and tedious though.

Or we could fall back on the technique shown in my post on Separating Values & Text.  So let’s do just that.

Keep rows that begin with alphabetical characters

To get started, let’s do the following:

  • Add a custom column
  • Use the following formula:

=Text.SplitAny(Text.Lower([Name]),”abcdefghijklmnopqrstuvwxyz”){0}

Based on the concepts covered in the previous post, this formula takes the Name column, converts it to lower case, then splits the text at any occurrence of the letters provided between the quotes.  This actually returns a list, so we use {0} to drill into the first instance.  The trick here is that if the text starts with a letter, it splits and results in a blank record.  If it’s not alphabetical, however, it gives us the character(s) that aren’t alphabetical:

image

So using that, we can wrap the function in an “if” statement instead.

  • Click the gear icon next to the Added Custom step
  • Modify the formula as follows

= if Text.SplitAny(Text.Lower([Name]),”abcdefghijklmnopqrstuvwxyz”){0} then “Remove” else null

This results in a column that helpfully tells us what we can remove:

image

So we can now simply:

  • Filter the Custom column to only keep null values (uncheck “Remove”)
  • Remove the Custom column

And voila!  We have now restricted our list to only show names that start with alphabetical characters.

image

Share:

Facebook
Twitter
LinkedIn

8 thoughts on “Keep rows that begin with alphabetical characters

  1. In the days before tables, I would have used an Advanced Filter:

    =OR(AND(CODE(LEFT(A2))>=65,CODE(LEFT(A2))=97,CODE(LEFT(A2))=65,CODE(LEFT(Table1[@Names]))=97,CODE(LEFT(Table1[@Names]))<=122))

  2. For sure, Jason. Definitely a viable solution if the data is already in a table in Excel.

    Of course, if the data is coming from the web or a database, you'd have to land it to an Excel table first. With Power Query this will let us remove those records without ever actually landing them in the Excel table first. 🙂

  3. Hi Wyn,

    What is actually happening is that the text is being split at every instance of the characters in the string we provided, losing the character we split by in the process. So when the first character is alphabetical, it actually leaves a blank.

    To see this in a bit more detail, pull in the raw data table, then use the following formula:
    =Text.SplitAny(Text.Lower([Name]),”abcdefghijklmnopqrstuvwxyz”)

    You'll get a list for each item that you can drill in to. Try stepping back to the formula providing a character string of "j" instead of the full "abcde..." You can flip back and forth adding letters to see how it affects the output. The {0} at the end of the original formula from the post is simply to pull the first value from that list.

  4. Hi Ken 🙂
    Thanks for explanations.
    This below is only for joke 🙂
    One step query:
    let
    Source = Table.FromColumns({List.RemoveNulls(List.Transform(Excel.CurrentWorkbook(){[Name="Table1"]}[Content][Names], each if List.IsDistinct({Text.At(_,0), "A".."Z"}) then null else _))}, {"Names"})
    in
    Source

    Regards

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Latest Posts