Saturday, June 22, 2024
HomeBig DataEasy methods to Run SQL on PDF Information

Easy methods to Run SQL on PDF Information


PDFs are the de facto commonplace for distributing and sharing fixed-layout paperwork at the moment. A fast survey of my laptop computer folders reveals account statements, receipts, technical papers, e-book chapters, and presentation slides—all PDFs. Numerous precious info finds its method into all method of PDF recordsdata. Which is a superb purpose for Rockset to help SQL queries on PDF recordsdata, in our mission to make information extra usable to everybody.

Quick SQL on PDFs in Rockset

Rockset makes it straightforward for builders and information practitioners to ingest and run quick SQL on semi-structured information in quite a lot of information codecs, equivalent to JSON, CSV, and XLSX, with none upfront information prep. Now add PDFs to the combo, and customers can mix PDF information with information of different codecs, from numerous sources, into their SQL analyses. Or analyzing a number of PDFs collectively is likely to be precious too, when you’ve got a sequence of electrical energy payments like I do, as we’ll see in our brief instance beneath.


bill-pdf


Importing PDFs

From an current assortment, click on the Add File button on the high proper of the console and specify PDF format to ingest into Rockset.


pdf-upload


Querying Information in PDFs

I uploaded 9 months of electrical energy payments. We will use the DESCRIBE command to view the fields that have been extracted from the PDFs.

> describe "elec-bills";
+--------------------------------------------+---------------+---------+-----------+
| discipline                                      | occurrences   | whole   | sort      |
|--------------------------------------------+---------------+---------+-----------|
| ['Author']                                 | 9             | 9       | string    |
| ['CreationDate']                           | 9             | 9       | string    |
| ['Creator']                                | 9             | 9       | string    |
| ['ModDate']                                | 9             | 9       | string    |
| ['Producer']                               | 9             | 9       | string    |
| ['Subject']                                | 9             | 9       | string    |
| ['Title']                                  | 9             | 9       | string    |
| ['_event_time']                            | 9             | 9       | timestamp |
| ['_id']                                    | 9             | 9       | string    |
| ['_meta']                                  | 9             | 9       | object    |
| ['_meta', 'file_upload']                   | 9             | 9       | object    |
| ['_meta', 'file_upload', 'file']           | 9             | 9       | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 9             | 9       | string    |
| ['_meta', 'file_upload', 'upload_time']    | 9             | 9       | string    |
| ['author']                                 | 9             | 9       | string    |
| ['creation_date']                          | 9             | 9       | int       |
| ['creator']                                | 9             | 9       | string    |
| ['modification_date']                      | 9             | 9       | int       |
| ['producer']                               | 9             | 9       | string    |
| ['subject']                                | 9             | 9       | string    |
| ['text']                                   | 9             | 9       | string    |
| ['title']                                  | 9             | 9       | string    |
+--------------------------------------------+---------------+---------+-----------+

Rockset parses out all of the metadata like writer, creation_date, and so forth. from the doc together with the textual content.

The textual content discipline is often the place many of the info in a PDF resides, so let’s study what’s in a pattern textual content discipline.

+--------------------------------------------------------------+
| textual content                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Assertion Date: 10/11/2018                                   |
| Your Account Abstract                                         |
| ....                                                         |
| Whole Quantity Due:                                            |
| $157.57                                                      |
| Quantity Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+

Combining Information from A number of PDFs

With my 9 months of eletricity payments ingested and listed in Rockset, I can do some easy evaluation of my utilization over this timespan. We will run a SQL question to pick the month/12 months and billing quantity out of textual content.

> with particulars as (
    choose tokenize(REGEXP_EXTRACT(textual content, 'Assertion Date: .*'))[3] as month,
    tokenize(REGEXP_EXTRACT(textual content, 'Assertion Date: .*'))[5] as 12 months,
    solid(tokenize(REGEXP_EXTRACT(textual content, 'Whole Quantity Due:n.*nAmount Enclosed'))[4] as float) as quantity
    from "elec-bills"
) 
choose concat(month, '/', 12 months) as billing_period, quantity
from particulars
order by 12 months asc, month;

+----------+------------------+
| quantity   | billing_period   |
|----------+------------------|
| 47.55    | 04/2018          |
| 76.5     | 05/2018          |
| 52.28    | 06/2018          |
| 50.58    | 07/2018          |
| 47.62    | 08/2018          |
| 39.7     | 09/2018          |
| <null>   | 10/2018          |
| 72.93    | 11/2018          |
| 157.57   | 12/2018          |
+----------+------------------+

And plot the ends in Superset.


pdf-graph

My October invoice was surprisingly zero. Was the billing quantity not extracted appropriately? I went again and checked, and it seems I obtained a California Local weather Credit score in October which zeroed out my invoice, so ingesting and querying PDFs is working because it ought to!



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments