Screenshot of Doc Analysis results in spreadsheet |
The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.
Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:
var str = body.getText().toString();Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:
var lookFor1 = "complete this section)?";
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):
var endOffset1 = str.indexOf("EVERYTHING BELOW");Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:
section1 = str.substring( startOffset1, endOffset1);
This is repeated for the second question of text we want to capture.
Download
Bulk extract text from Google Docs for analysis download (please use 'Overview' > 'Make a copy' for your own version).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* | |
Function to get Google Docs and pass to infoGrabber. | |
*/ | |
function run() { | |
// get Google Drive folder containing Docs to analyse | |
var folderId = "ENTER ID HERE"; | |
var folder = DriveApp.getFolderById(folderId); | |
var files = folder.getFiles(); | |
// get current relevant sheet and clear its contents | |
var ss = SpreadsheetApp.getActiveSpreadsheet(); | |
var sheet = ss.getSheetByName('Sheet1'); | |
sheet.clear(); | |
// loop through each Doc in Google Drive folder | |
while(files.hasNext()) { | |
// get next file and its ID | |
var file = files.next(); | |
var fileId = file.getId(); | |
var fileUrl = file.getUrl(); | |
var fileName = file.getName(); | |
// run Function to extract relevant text from Doc body | |
var results = infoGrabber(fileId); | |
// extract sections of text from returned data | |
var section1 = results[0]; | |
var section2 = results[1]; | |
// push extra items into the array and append to the spreadsheet | |
results.push(fileUrl); | |
results.push(fileName); | |
sheet.appendRow(results); | |
} | |
} | |
/* | |
Function to extract Google Doc body and filter out relevant data. | |
*/ | |
function infoGrabber(id) { | |
// create empty variables for later pushing relevant text into | |
var section1 = ""; | |
var section2 = ""; | |
try{ | |
// open the Doc and get its body | |
var doc = DocumentApp.openById(id); | |
var body = doc.getBody(); | |
// get the text from the Doc | |
var str = body.getText().toString(); | |
// create a variable containing unique string that comes just before the result we want to extract | |
var lookFor1 = "complete this section)?"; | |
// look for the previous unique string + its length, so we can start at the next character (the beginning of the result we want) | |
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length; | |
// create an end variable that is the start of the next chunk of unwanted text | |
var endOffset1 = str.indexOf("EVERYTHING BELOW"); | |
// get all the text using the above start and end parameters | |
section1 = str.substring( startOffset1, endOffset1); | |
// use a Regex to remove the long underscore in the Doc from the result | |
section1 = section1.replace(/[_]+/,""); | |
// remove any whitespace surrounding the result text | |
section1 = section1.trim(); | |
// ********************************************************************************** | |
// create a variable containing unique string that comes just before the result we want to extract | |
var lookFor = "prior feedback:"; | |
// look for the previous unique string + its length, so we can start at the next character (the beginning of the result we want) | |
var startOffset2 = str.indexOf(lookFor) + lookFor.length; | |
// create an end variable that is the start of the next chunk of unwanted text | |
var endOffset2 = str.indexOf("TUTORIAL ESSAY MARKING GUIDELINES"); | |
// get all the text using the above start and end parameters | |
section2 = str.substring(startOffset2, endOffset2); | |
// remove any whitespace surrounding the result text | |
section2 = section2.trim(); | |
// return both sections to the Parent Function | |
return [section1, section2]; | |
} | |
catch(e){ | |
// return error details if there is a problem | |
return ([e, e.stack]); | |
} | |
} | |
// Function to create menu item. | |
function onOpen() { | |
SpreadsheetApp.getUi() | |
.createMenu('Custom Menu') | |
.addItem('Run Analysis Grabber', 'run') // label for menu item, name of function to run. | |
.addToUi(); | |
} |
This is very useful, when i run this code, it is overwriting the header rows and over wirting the my search texts
ReplyDeleteAh yes, thank you for spotting this. I have now updated the code the leave Row 1 alone.
DeleteKind regards
Phil
Hi, great work, honestly.
ReplyDeleteOne thing though, this is still erasing the first row for me.
I was able to collect also the folderName.
I was able to include the startOffset tag because it's only 2 charaters by using:
section1 = str.substring( startOffset1-2, endOffset1);
But I'm sure there is a better way to include the startOffset
And also, how would I go about getting a column with the file url that export the file into a PDF (same fileURL basically but ending ending with /export?format=pdf)?
Thanks
Hi Menkashoo
DeleteI assume you've taken a new copy of the file since my post on 5 November where I've fixed the first row erasing problem?
To get that PDF Url one way would be via a bit of concatenation so you add '/export?format=pdf' to the Google Doc Url (Line 23).
Kind regards
Phil
And how would I go about reorganizing the columns in a different order?
ReplyDeleteHi Menkashoo
DeleteThe columns match the format of the data in the array - so you would need to be comfortable understanding how arrays work in order to try and manipulate this (as it would need changing in a few places in the code).
A simple alternative may be to just use an ImportRange query in another Google Sheet to pull the data through and arrange as you wish: https://support.google.com/docs/answer/3093340?hl=en-GB
Kind regards
Phil
Thank you for this! Absolutely wonderful!
ReplyDeleteThank you Orrdan!
DeleteKind regards
Phil
how do i get all the text from each file?
ReplyDeleteHi
DeleteYou might find this blog post more useful: https://www.pbainbridge.co.uk/2021/07/extract-text-from-multiple-google-docs.html
Kind regards
Phil
Hi, first this was very helpful second is there a way i can get the information to appear in the next column according to how many lines the text is in for example:
ReplyDelete"I am
haris
hello"
to appear as
I am | haris | hello
and is there a way to get multiple data from different parts of the document.
Thanking you in advance.
Hi Haris
DeleteThank you very much.
Unfortunately I've not done any further text analysis work with Apps Script to be able to give you a direct response to your enquiry.
Kind regards
Phil