Python Using Regex on Entire Excel File

September 30, 2018

Python Using Regex on Entire Excel File

Following up on my previous post that used VBA in Excel to pull out Regex matches from a spreadsheet, I wanted to demonstrate that Python can accomplish basically the same goal. Python needs a little help to work with Excel files in the form of openpyxl and pandas modules that will need to be installed for this to work.

Here are the steps I took to pull out all the numbers with 9 to 14 characters from an Excel file and put them in the clipboard:

Easy File Explorer Popup

We need an easy way of telling Python where our spreadsheet is saved at. With tkinter we can have a windows file explorer popup so the user can browse to their excel file they want to pull data out of.

from tkinter import Tk
from tkinter.filedialog import askopenfilename
def get_excel_path():
    Tk().withdraw()
    return askopenfilename(filetypes = [('Excel Files','*.csv;*.xlsx;*.xlsm;*.xltx;*.xltm')],
           title = 'Select your Excel File and Click Open')

Convert Excel File Into Pandas Data-frame

The pandas module does double duty in allowing us to upload csv files and ensure our outputs are useful as a data-frame object. The openpyxl module gives us the ability to load Excel specific files and also outputs them as pandas data-frame objects. We just have to check the path given to know the file type in order to make sure we use the right function, pandas.read_csv or openpyxl.load_workbook.

import os
import pandas as pd
from openpyxl import load_workbook
def upload_spreadsheet(path, active_sheet_only = True):
    '''Returns pandas dataframe. Returns empty dataframe if fails'''
    #check if file exists
    if not os.path.isfile(path):
        return pd.DataFrame()

    #check file type
    file_type = os.path.splitext(path)[1]

    if file_type == '.csv':
        df = pd.read_csv(path)
    elif file_type in ['.xlsx','.xlsm','.xltx','.xltm']:
        wb = load_workbook(path)
        #convert sheets to pandas dataframe
        if active_sheet_only:
            df = pd.DataFrame(wb.active.values)
        else:
            #combine all sheets into one dataframe
            frames = [pd.DataFrame(sheet.values) for sheet in wb.worksheets]
            df = pd.concat(frames)
  wb.close()
    else:
        return pd.Dataframe()

    return df

Combine Our Data-frame Into String

Once we have our data-frame we can easily combine it into one string with a list comprehension so we can run our Regex against it just one time.

def combine_dataframe(dataframe):
    '''Converts pandas dataframe into one list'''
    return [cell for cell in [dataframe[i] for i in dataframe]]

Pull Out Regex Matches from String

With re.findall, the only complication is when it returns a list of tuples sometimes, depending on the regex you use. We have to check the list items for type tuple and pull out just the matches (the first items in the tuples) in order to get a usable list of matches to return.

import re
def text_search(text, regex, dedupe=True):
    re_matches = re.findall(regex, str(text))
    if re_matches:
        #remove all but first match if tuples
        re_matches = [y[0] if type(y) == tuple else y for y in re_matches]
        if dedupe:
            return list(set(re_matches))
        return list(re_matches)
    return []

Copy String to Clipboard

import subprocess
def copy_text(text):
subprocess.run(['clip.exe'], input = text.strip().encode('ascii', 'ignore'), check = True)

Put It All Together

Now we have all the functions we need to upload our Excel spreadsheet and pull out all the numbers with 9 to 14 characters.

def main():
    file_path = get_excel_path()
    if file_path == '':return
    df = upload_spreadsheet(file_path, active_sheet_only = False)
    #convert to a list then join into string
    text = ' '.join([str(x) for x in combine_dataframe(df)])
    pattern = r'([0-9]{9,14})'
    matches = text_search(text, pattern)
    copy_text('\n'.join(matches))
    print(len(matches), 'matches added to your clipboard.')

if __name__ == '__main__':
    main()

The full Python code can be found on my github:

https://github.com/eddiethedean/Excel-Python

Comments

sangeetha sathyanFebruary 4, 2019 at 11:13 PM
Excellent post. I learned a lot from this blog and I suggest my friends to visit your blog to learn new concept about technology.
ccna course in Chennai
Python Training in Chennai
R Programming Training in Chennai
AWS Training in Chennai
DevOps Training in Chennai
Angularjs Training in Chennai
RPA Training in Chennai
Blue Prism Training in Chennai
Data Science Course in Chennai
ReplyDelete
Replies
KayalJune 4, 2019 at 10:18 PM
Thank you for providing the creativity post with comprehensive content. This is very helpful for gain my skill knowledge, Kindly updating...
Linux Training in Chennai
Learn Linux
Spark Training in Chennai
Oracle Training in Chennai
Unix Training in Chennai
Oracle DBA Training in Chennai
Tableau Training in Chennai
Power BI Training in Chennai
Embedded System Course Chennai
ReplyDelete
Replies
AnonymousDecember 24, 2021 at 6:27 AM
Great job for publishing such a nice article. Your article isn’t only useful but it is additionally really informative. Read more info about Hni clients database. Thank you because you have been willing to share information with us.
ReplyDelete
Replies
AnonymousFebruary 6, 2022 at 8:06 PM
The best casino games for a mobile device - Aprcasino
Best online 카지노 사이트 유니88 casino games for a mobile device – Discover a list of online casino games, play slots, and more and try them for free!
ReplyDelete
Replies
AnonymousSeptember 8, 2023 at 10:15 PM
I really appreciate your efforts for writing this blog.It was reaaly interesting and helpful.
Python Classes in Nagpur
ReplyDelete
Replies

Add comment

Search This Blog

Odos Matthews Programming Blog

Python Using Regex on Entire Excel File

Easy File Explorer Popup

Convert Excel File Into Pandas Data-frame

Combine Our Data-frame Into String

Pull Out Regex Matches from String

Copy String to Clipboard

Put It All Together

Comments

Post a Comment

Popular Posts

How to Make a Python Class Immutable