Tuesday, 5 May 2020

Bulk extract text from Google Docs for analysis

The following Google Apps Script is designed to go through a number of Google Docs and extract specific text from the body that represents an answer to a question. The tool was developed as a result of a Researcher needing to analyse hundreds of files that were each an answer to various survey questions. There were two specific sections that needed to be targeted and the content collated into a spreadsheet so further analysis could be performed.
Screenshot of Doc Analysis results in spreadsheet
Screenshot of Doc Analysis results in spreadsheet

The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.

Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:
var str = body.getText().toString();
var lookFor1 = "complete this section)?";
Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;
We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):
var endOffset1 = str.indexOf("EVERYTHING BELOW");
Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:
section1 = str.substring( startOffset1, endOffset1);

This is repeated for the second question of text we want to capture.

 

Download

Bulk extract text from Google Docs for analysis download (please use 'Overview' > 'Make a copy' for your own version).

 

/*
Function to get Google Docs and pass to infoGrabber.
*/
function run() {
// get Google Drive folder containing Docs to analyse
var folderId = "ENTER ID HERE";
var folder = DriveApp.getFolderById(folderId);
var files = folder.getFiles();
// get current relevant sheet and clear its contents
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheetByName('Sheet1');
sheet.clear();
// loop through each Doc in Google Drive folder
while(files.hasNext()) {
// get next file and its ID
var file = files.next();
var fileId = file.getId();
var fileUrl = file.getUrl();
var fileName = file.getName();
// run Function to extract relevant text from Doc body
var results = infoGrabber(fileId);
// extract sections of text from returned data
var section1 = results[0];
var section2 = results[1];
// push extra items into the array and append to the spreadsheet
results.push(fileUrl);
results.push(fileName);
sheet.appendRow(results);
}
}
/*
Function to extract Google Doc body and filter out relevant data.
*/
function infoGrabber(id) {
// create empty variables for later pushing relevant text into
var section1 = "";
var section2 = "";
try{
// open the Doc and get its body
var doc = DocumentApp.openById(id);
var body = doc.getBody();
// get the text from the Doc
var str = body.getText().toString();
// create a variable containing unique string that comes just before the result we want to extract
var lookFor1 = "complete this section)?";
// look for the previous unique string + its length, so we can start at the next character (the beginning of the result we want)
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;
// create an end variable that is the start of the next chunk of unwanted text
var endOffset1 = str.indexOf("EVERYTHING BELOW");
// get all the text using the above start and end parameters
section1 = str.substring( startOffset1, endOffset1);
// use a Regex to remove the long underscore in the Doc from the result
section1 = section1.replace(/[_]+/,"");
// remove any whitespace surrounding the result text
section1 = section1.trim();
// **********************************************************************************
// create a variable containing unique string that comes just before the result we want to extract
var lookFor = "prior feedback:";
// look for the previous unique string + its length, so we can start at the next character (the beginning of the result we want)
var startOffset2 = str.indexOf(lookFor) + lookFor.length;
// create an end variable that is the start of the next chunk of unwanted text
var endOffset2 = str.indexOf("TUTORIAL ESSAY MARKING GUIDELINES");
// get all the text using the above start and end parameters
section2 = str.substring(startOffset2, endOffset2);
// remove any whitespace surrounding the result text
section2 = section2.trim();
// return both sections to the Parent Function
return [section1, section2];
}
catch(e){
// return error details if there is a problem
return ([e, e.stack]);
}
}
// Function to create menu item.
function onOpen() {
SpreadsheetApp.getUi()
.createMenu('Custom Menu')
.addItem('Run Analysis Grabber', 'run') // label for menu item, name of function to run.
.addToUi();
}

12 comments:

  1. This is very useful, when i run this code, it is overwriting the header rows and over wirting the my search texts

    ReplyDelete
    Replies
    1. Ah yes, thank you for spotting this. I have now updated the code the leave Row 1 alone.

      Kind regards
      Phil

      Delete
  2. Hi, great work, honestly.
    One thing though, this is still erasing the first row for me.

    I was able to collect also the folderName.

    I was able to include the startOffset tag because it's only 2 charaters by using:
    section1 = str.substring( startOffset1-2, endOffset1);

    But I'm sure there is a better way to include the startOffset

    And also, how would I go about getting a column with the file url that export the file into a PDF (same fileURL basically but ending ending with /export?format=pdf)?

    Thanks

    ReplyDelete
    Replies
    1. Hi Menkashoo

      I assume you've taken a new copy of the file since my post on 5 November where I've fixed the first row erasing problem?

      To get that PDF Url one way would be via a bit of concatenation so you add '/export?format=pdf' to the Google Doc Url (Line 23).

      Kind regards
      Phil

      Delete
  3. And how would I go about reorganizing the columns in a different order?

    ReplyDelete
    Replies
    1. Hi Menkashoo

      The columns match the format of the data in the array - so you would need to be comfortable understanding how arrays work in order to try and manipulate this (as it would need changing in a few places in the code).

      A simple alternative may be to just use an ImportRange query in another Google Sheet to pull the data through and arrange as you wish: https://support.google.com/docs/answer/3093340?hl=en-GB

      Kind regards
      Phil

      Delete
  4. Thank you for this! Absolutely wonderful!

    ReplyDelete
  5. how do i get all the text from each file?

    ReplyDelete
    Replies
    1. Hi

      You might find this blog post more useful: https://www.pbainbridge.co.uk/2021/07/extract-text-from-multiple-google-docs.html

      Kind regards
      Phil

      Delete
  6. Hi, first this was very helpful second is there a way i can get the information to appear in the next column according to how many lines the text is in for example:
    "I am
    haris
    hello"
    to appear as
    I am | haris | hello
    and is there a way to get multiple data from different parts of the document.
    Thanking you in advance.

    ReplyDelete
    Replies
    1. Hi Haris

      Thank you very much.

      Unfortunately I've not done any further text analysis work with Apps Script to be able to give you a direct response to your enquiry.

      Kind regards
      Phil

      Delete