Python Using Regex on Entire Excel File

Following up on my previous post that used VBA in Excel to pull out Regex matches from a spreadsheet, I wanted to demonstrate that Python can accomplish basically the same goal. Python needs a little help to work with Excel files in the form of openpyxl and pandas modules that will need to be installed for this to work.

Here are the steps I took to pull out all the numbers with 9 to 14 characters from an Excel file and put them in the clipboard:


Easy File Explorer Popup

We need an easy way of telling Python where our spreadsheet is saved at. With tkinter we can have a windows file explorer popup so the user can browse to their excel file they want to pull data out of.

from tkinter import Tk
from tkinter.filedialog import askopenfilename
def get_excel_path():
    Tk().withdraw()
    return askopenfilename(filetypes = [('Excel Files','*.csv;*.xlsx;*.xlsm;*.xltx;*.xltm')],
           title = 'Select your Excel File and Click Open')
      

Convert Excel File Into Pandas Data-frame

The pandas module does double duty in allowing us to upload csv files and ensure our outputs are useful as a data-frame object. The openpyxl module gives us the ability to load Excel specific files and also outputs them as pandas data-frame objects. We just have to check the path given to know the file type in order to make sure we use the right function, pandas.read_csv or openpyxl.load_workbook. 
    
import os
import pandas as pd
from openpyxl import load_workbook
def upload_spreadsheet(path, active_sheet_only = True):
    '''Returns pandas dataframe. Returns empty dataframe if fails'''
    #check if file exists
    if not os.path.isfile(path):
        return pd.DataFrame()
       
    #check file type
    file_type = os.path.splitext(path)[1]
   
    if file_type == '.csv':
        df = pd.read_csv(path)
    elif file_type in ['.xlsx','.xlsm','.xltx','.xltm']:
        wb = load_workbook(path)
        #convert sheets to pandas dataframe
        if active_sheet_only:
            df = pd.DataFrame(wb.active.values)
        else:
            #combine all sheets into one dataframe
            frames = [pd.DataFrame(sheet.values) for sheet in wb.worksheets]
            df = pd.concat(frames)

        wb.close()
    else:
        return pd.Dataframe()
  
    return df
     

Combine Our Data-frame Into String

Once we have our data-frame we can easily combine it into one string with a list comprehension so we can run our Regex against it just one time.

def combine_dataframe(dataframe):
    '''Converts pandas dataframe into one list'''
    return [cell for cell in [dataframe[i] for i in dataframe]]
   

Pull Out Regex Matches from String

With re.findall, the only complication is when it returns a list of tuples sometimes, depending on the regex you use. We have to check the list items for type tuple and pull out just the matches (the first items in the tuples) in order to get a usable list of matches to return.

import re
def text_search(text, regex, dedupe=True):
    re_matches = re.findall(regex, str(text))
    if re_matches:
        #remove all but first match if tuples
        re_matches = [y[0] if type(y) == tuple else y for y in re_matches]
        if dedupe:
            return list(set(re_matches))
        return list(re_matches)
    return []
   

Copy String to Clipboard

import subprocess
def copy_text(text):
    subprocess.run(['clip.exe'], input = text.strip().encode('ascii', 'ignore'), check = True)
    

Put It All Together

Now we have all the functions we need to upload our Excel spreadsheet and pull out all the numbers with 9 to 14 characters.
   
def main():
    file_path = get_excel_path()
    if file_path == '':return
    df = upload_spreadsheet(file_path, active_sheet_only = False)
    #convert to a list then join into string
    text = ' '.join([str(x) for x in combine_dataframe(df)])
    pattern = r'([0-9]{9,14})'
    matches = text_search(text, pattern)
    copy_text('\n'.join(matches))
    print(len(matches), 'matches added to your clipboard.')
   
if __name__ == '__main__':
    main()



The full Python code can be found on my github:

https://github.com/eddiethedean/Excel-Python

Comments

  1. Replies
    1. Regular expressions (regex) aren't ideal for processing entire Excel files. They're better suited for manipulating specific text within the file. However, Python offers libraries for working with Excel data effectively. Here's how to approach this:

      Libraries:

      python projects for engineering students
      openpyxl: This library allows you to read and write data in Excel files (xlsx format).
      pandas: This library provides powerful data manipulation tools for analyzing and processing data from Excel files.
      Approach:

      Read the Excel file: Use openpyxl.load_workbook() to open the file.
      Iterate through sheets (optional):
      If you need to process data from all sheets, use a loop to iterate through each sheet in the workbook (workbook.sheetnames).
      Process data:

      Big Data Projects For Final Year Students
      Use pandas' pd.read_excel() function to read the data into a pandas DataFrame. This allows you to use various methods like filtering, grouping, and applying regex operations on specific columns.

      Machine Learning Projects for Final Year

      Delete
  2. Great job for publishing such a nice article. Your article isn’t only useful but it is additionally really informative. Read more info about Hni clients database. Thank you because you have been willing to share information with us.

    ReplyDelete
  3. The best casino games for a mobile device - Aprcasino
    Best online 카지노 사이트 유니88 casino games for a mobile device – Discover a list of online casino games, play slots, and more and try them for free!

    ReplyDelete
  4. I really appreciate your efforts for writing this blog.It was reaaly interesting and helpful.
    Python Classes in Nagpur

    ReplyDelete

Post a Comment

Popular Posts