From Data Chaos to Clarity: Creating Your Code for Efficient Literature Review (ChatGPT-based)

4 min readMay 1, 2023

In the world we live in today, information is everywhere, and it can be overwhelming to manage it all. This is especially true in the scientific community, where new research and papers are published daily. This information overload can make it difficult for scientists and non-scientists to keep up with the latest findings.

To help solve this problem, I developed a Python code that allows you to efficiently ‘read’ as many pdf articles as you want and ask specific questions about the content of the articles. Using ChatGPT, an advanced language model, the code allows you to get answers to your questions and save them in a structured way in a .csv file.

The inspiration for this project came from the need to understand the current state of research on climate change. While a wealth of information is available, navigating and extracting key insights can be difficult. This code can help streamline the process, allowing you to quickly and easily access important information and make informed decisions based on the latest scientific evidence.

In this article, I will provide an overview of how the code works and how it can be used to make literature reviews more efficient and effective. I will not go into too much detail about the code itself, but I will provide enough information to help you start your project.

The complete code can be accessed on my GitHub.

Import necessary libraries: os, PyPDF2, re, openai, csv, and time.

import os
import PyPDF2
import re
import openai
import csv
import time

Define a function extract_sections() that takes a PDF file as input and extracts the introduction and discussion sections from the file. The function first opens the PDF file using PyPDF2 and extracts the text from all pages of the file. It then converts the text to lowercase and uses regular expressions to search for the introduction and discussion sections in the text. If both sections are found, the function extracts the text between them and takes the first 1,500 tokens of each section. Finally, the function returns the extracted text or the string 'DidntFind' if the sections were not found.

# Define a function to extract the introduction and discussion sections from a PDF file
def extract_sections(pdf_file):
    didnt_find = []
    
# Open the PDF file
    with open(pdf_file, 'rb') as f:
        pdf_reader = PyPDF2.PdfReader(f)
        
        # Iterate through the pages of the PDF file and extract the text
        text = ''
        for i in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[i]
            text += page.extract_text()

        # Convert all the text to lowercase
        text = text.lower()

        # Use regular expressions to find the introduction and discussion sections
        intro_match = re.search(r'introduction', text)
        disc_match = re.search(r'discussion', text)

        # If the sections are found, extract the first 1,500 tokens
        if intro_match and disc_match:
            intro_end = intro_match.end()
            disc_start = disc_match.start()
            intro_text = text[intro_end:disc_start]
            disc_text = text[disc_match.end():]

            # Take the first 1,500 tokens of the text after the title section
            intro_tokens = intro_text.split()
            disc_tokens = disc_text.split()
            if len(intro_tokens) > 1500:
                intro_text = ' '.join(intro_tokens[:1500])
            if len(disc_tokens) > 1500:
                disc_text = ' '.join(disc_tokens[:1500])

            return intro_text, disc_text
        else:
            return 'DidntFind'

Define a function ask_questions() that takes a string of text as input and uses OpenAI's ChatGPT to answer a list of questions. The function first loops through the predefined list of questions and concatenates each question with the input text. It then sends a message to ChatGPT with the concatenated string as the user input and "Climate change specialist"as the system role. The function uses a temperature of 0.1 to ensure the generated responses are not too diverse. The function appends each response to a list and waits for 1 second before asking the next question. Finally, the function returns the list of responses.

# Define a function to ask OpenAI the questions and parse the responses
def ask_questions(text):
    results = []
    for question in questions:
        message1 = question + "\n" + text
        message2 = [{"role": "system", "content": "Climate change specialist"}, {"role": "user", "content": f"{message1}"}]
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages= message2,
            temperature=0.1
        )
        answer = response.choices[0].message.content
        results.append(answer)
        time.sleep(1)
    return results

Set the path to the folder containing the PDF articles in the pdf_folder variable.

# Set the path to the folder containing the PDF articles
pdf_folder = 'path'

Set up OpenAI API credentials by setting the openai.api_key variable to the API key for the OpenAI model being used.

# Set up OpenAI API credentials
openai.api_key = "Your API Key"

Define a list of questions to ask in the questions variable.

# Set up the questions to ask
questions = [
    "What are the impacts of climate change?", "other questions"
]

Open a CSV file for writing and write the header row with the column names 'File', 'Question', and 'Answer'. Loop through all the PDF files in the folder specified in pdf_folder. If a file does not have a .pdf extension, the loop skips it. For each PDF file, extract the introduction and discussion sections using the extract_sections() function. If the sections are found, ask the questions using the ask_questions() function and parse the responses. Write the results to the CSV file with the filename, question, and answer as the values for each row. If the sections are not found, write the filename and the string 'DidntFind' as the values for a single row in the CSV file.

# Open the CSV file for writing
with open('results.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)

    # Write the header row
    writer.writerow(['File', 'Question', 'Answer'])

    # Loop through all the PDF files in the folder
    for filename in os.listdir(pdf_folder):
        if filename.endswith('.pdf'):
            pdf_file = os.path.join(pdf_folder, filename)
            print(filename)

            # Extract the introduction and discussion sections from the PDF file
            sections = extract_sections(pdf_file)
            if sections != 'DidntFind':
                intro_text, disc_text = sections
                print('Extracting the introduction and the discussion')

                # Ask OpenAI the questions and parse the responses
                intro_results = ask_questions(intro_text)
                disc_results = ask_questions(disc_text)
                print('Asking ChatGPT')

                # Write the results to the CSV file
                for i in range(len(questions)):
                    print('Saving')
                    writer.writerow([filename, questions[i], intro_results[i]])
                    writer.writerow([filename, questions[i], disc_results[i]])
            
            else:
                writer.writerow([filename, 'DidntFind', 'DidntFind'])
                print('Saving')

The complete code can be accessed on my GitHub. Please comment if you have any specific questions, and I will try to help you.

From Data Chaos to Clarity: Creating Your Code for Efficient Literature Review (ChatGPT-based)

Written by César P. Soares, PhD