Screenshot of Doc Analysis results in spreadsheet |
The results from each Google Doc is added as a row into a Google Sheet, along with a link to the relevant file and its name. As the script loops through each file it calls a Function called infoGrabber that performs the task of extracting only the relevant text.
Once we get the Doc body as a single string we specify the piece of unique text that appears just before the answer to the question that we want:
var str = body.getText().toString();Next we create a start-value that looks for the position of the above text within the Doc body and then adds its length so it will start at the next character along - which will be the beginning of our answer text that we want to extract:
var lookFor1 = "complete this section)?";
var startOffset1 = str.indexOf(lookFor1) + lookFor1.length;We also need a finish-value which will be the start of the next chunk of unwanted text in the Doc (the next question in this example):
var endOffset1 = str.indexOf("EVERYTHING BELOW");Now we can get all of the text inbetween the above two values and capture this within our spreadsheet:
section1 = str.substring( startOffset1, endOffset1);
This is repeated for the second question of text we want to capture.
Download
Bulk extract text from Google Docs for analysis download (please use 'Overview' > 'Make a copy' for your own version).
This is very useful, when i run this code, it is overwriting the header rows and over wirting the my search texts
ReplyDeleteAh yes, thank you for spotting this. I have now updated the code the leave Row 1 alone.
DeleteKind regards
Phil
Hi, great work, honestly.
ReplyDeleteOne thing though, this is still erasing the first row for me.
I was able to collect also the folderName.
I was able to include the startOffset tag because it's only 2 charaters by using:
section1 = str.substring( startOffset1-2, endOffset1);
But I'm sure there is a better way to include the startOffset
And also, how would I go about getting a column with the file url that export the file into a PDF (same fileURL basically but ending ending with /export?format=pdf)?
Thanks
Hi Menkashoo
DeleteI assume you've taken a new copy of the file since my post on 5 November where I've fixed the first row erasing problem?
To get that PDF Url one way would be via a bit of concatenation so you add '/export?format=pdf' to the Google Doc Url (Line 23).
Kind regards
Phil
And how would I go about reorganizing the columns in a different order?
ReplyDeleteHi Menkashoo
DeleteThe columns match the format of the data in the array - so you would need to be comfortable understanding how arrays work in order to try and manipulate this (as it would need changing in a few places in the code).
A simple alternative may be to just use an ImportRange query in another Google Sheet to pull the data through and arrange as you wish: https://support.google.com/docs/answer/3093340?hl=en-GB
Kind regards
Phil
Thank you for this! Absolutely wonderful!
ReplyDeleteThank you Orrdan!
DeleteKind regards
Phil
how do i get all the text from each file?
ReplyDeleteHi
DeleteYou might find this blog post more useful: https://www.pbainbridge.co.uk/2021/07/extract-text-from-multiple-google-docs.html
Kind regards
Phil
Hi, first this was very helpful second is there a way i can get the information to appear in the next column according to how many lines the text is in for example:
ReplyDelete"I am
haris
hello"
to appear as
I am | haris | hello
and is there a way to get multiple data from different parts of the document.
Thanking you in advance.
Hi Haris
DeleteThank you very much.
Unfortunately I've not done any further text analysis work with Apps Script to be able to give you a direct response to your enquiry.
Kind regards
Phil