X
    Categories: PHP

How to Extract Text from PDF using PHP

You can save text/image data to PDF (Portable Doc Format) files for offline use. A PDF file can be used to display text/graphics content online. A web viewer can be used to embed PDF files in the browser. The PDF file embedded on a web page does not include the text/graphics content. SEO suffers from the inability to render the PDF content on the page. Extract text from PDF to overcome this problem and add it to the web page.

The PDF Parser library can be used to extract elements from PDF files with PHP. This PHP library pulls the text content from all pages and parses PDF files. PHP can parse the PDF file to extract text, headers, and metadata. This tutorial will demonstrate how to extract the text from PDF files with PHP.

This example script will show you how to use the PDF Parser library for extracting text from PDF using PHP. We will also show you how to upload PDF files and extract data on-the-fly using PHP.

Install PDF Parser Library

Use the following command to install the PDF Parser library with the composer.

composer require smalot/pdfparser

Note: You don’t have to install the PDF Parser library on its own, as all required files are provided within the code source. You can download the source code if are looking to install and run PDF Parser with a composer.

Incorporate autoloader for loading PDF Parser library and helper functions within a PHP script. PHP script.

include 'vendor/autoload.php';

Extract Text from PDF

The following code snippet extracts all the text content from a PDF file using PHP.

  • Initialize and load PDF Parser library.
  • Specify the source PDF file from which the text content will retrieve.
  • Parse PDF file using parseFile() the function of the PDF Parser class.
  • Extract text from PDF using getText() the method of the PDF Parser class.
<?php

$parser = new \Smalot\PdfParser\Parser();

$PDFfile = 'test.pdf';

$PDF = $parser->parseFile($PDFfile);

$PDFContent = $PDF->getText();

echonl2br($PDFContent);

?>

Here is the PDF Parser library documentation you can explore more features.

Upload PDF File and Extract Text

This code snippet will show how to upload PDFs and extract the text with PHP.

PDF Form for Uploading Files:

Define HTML elements for forms for uploading files.

<form action="parse.php" method="POST" enctype="multipart/form-data">

   <div class="pdf-input"> 

      <label for="pdf">PDF File</label> 

      <input type="file" id="pdf" name="pdf" placeholder="Select a PDF file" required=""> 

   </div> 

   <input type="submit" name="submit" class="btn btn-large" value="Submit"> 

</form>

When you submit the form the file selected is uploaded to the server script to process further.

Server-side script (parse.php) to extract text from PDF File:

The code below can be used for uploading the document and extracting the information from the PDF.

  • Retrieve the name of the file through “$_FILES” inside PHP.
  • Extend the file by using the Pathinfo() function with PATHINFO_EXTENSION Filter.
  • Verify the file to determine whether it’s an official PDF file.
  • Find the path to the file by using tmp_name inside $_FILES.
  • Parse the PDF file you have uploaded and extract text content with the help of the pdf Parser library.
  • Format text content by replacing newlines (\n) with a line break (<br>) employing the nl2br() function within PHP.
$PDFContent = '';

if(isset($_POST['submit'])){

   if(!empty($_FILES["pdf"]["name"])){

      $PDFfileName = basename($_FILES["pdf"]["name"]);

      $PDFfileType = pathinfo($PDFfileName, PATHINFO_EXTENSION);

      $allowTypes = array('pdf');

      if(in_array($PDFfileType, $allowTypes)){

         include 'vendor/autoload.php';

         $parser = new \Smalot\PdfParser\Parser();



         // Source file

         $PDFfile = $_FILES["pdf"]["tmp_name"];

         $PDF = $parser->parseFile($PDFfile);

         $fileText = $PDF->getText();



         // line break

         $PDFContent = nl2br($fileText);

      }

      else

      {

         $PDFContent = '<p>only PDF file is allowed to upload.</p>';

      }

   }

   else

   {

      $PDFContent = '<p>Please select a file.</p>';

   }

}

// Display content

echo $PDFContent;
Huzoor Bux: I am a PHP Developer

View Comments (2)

  • Can you help?

    Executing in my local machine, using Xampp, returns this error:

    Warning: Undefined array key "pdf_file" in C:\xampp\htdocs\imoveis\parse.php on line 5

    Warning: Trying to access array offset on value of type null in C:\xampp\htdocs\imoveis\parse.php on line 5

    only PDF file is allowed to upload.