Monday, February 18, 2013

SamShares - Parsing financial data out of annual report PDFs

What's up

I've been doing a lot of financial research, and a big chunk of that is looking through financial reports, manually copying the fields for assets, liabilities, equity, EBIT etc. It's boring as hell, and takes a long time. Why can't we automate this?

Parsing PDFs

I started by forking PyPDF2 to give me better access to the underlying objects. It's a fairly good start for working with PDFs, but just blurts out (some of) the text in a random order, which isn't what I want. This lead me down a bit of a rabbit hole and lead to me downloading a copy of the PDF 1.7 reference and browsing through this, sections 5.2 and 5.3 in particular

What's the plan?

  • Find the pages with assets/liabilites and income
  • Render them such that it's obvious where the columns and rows line up
  • Convert this to a spreadsheet
  • ???
  • PROFIT
For example, above is a screenshot from the annual report of New Zealand's largest NZX company, Fletcher Building. The PDF displays like lovely rows and columns, but can't be easily accessed in this way. If we can parse the PDF and render all the text in place, we can then make fairly accurate guesses at which rows and columns the values fall into.

Quick primer to text in PDF

Here are some of the operators you'll find for manipulating text in a PDF

BT, ET - Start and end a text object. This initialises the text matrix to the identify matrix - i.e. positioned at the top left of the document
Td, TD, T* - Operators to move the cursor to the next line
TM - Sets the text matrix. This is an affine transform, with 6 parameters - the first 4 matter for manipulating the text itself (scaling, warping, italics), and the last two essentially just set the start point for the text. This is enough for us to cheat and guess which way the text will go
Tc, Tw, Tf and lots more - Spacing and font settings

Tj, TJ - Display a text string - Tj does this simply, TJ has options after each character/substring for spacing information

Putting it all together

To parse a table out of a PDF, here's the rough idea:
  1. Locate all the strings on a page (BT/ET and TJ/Tj operators)
  2. Create a structure which ties the strings to locations (probably just Tm)
  3. Assign values row and column IDs
Once this is done, just check what is at the leftmost and topmost of each table, and use these as keys to the data. For the above image, the field "total assets" lined up with "June 2012" gives two results, so these just need to be referenced to the headers at the top, OR we can cheat and use the leftmost as this is generally the convention.

Next steps

Assuming I can make all this work, the data will then just be stored in a DB of some sort, keyed by year and company. Once this is automated enough to just pull PDFs out of NZX announcements, it'll be left in the background accumulating data, eventually building a corpus of financial data from NZX companies that can be used to make financial analysis much, much quicker and more versatile than it currently is.


3 comments:

  1. How is your project going? Any updates?

    ReplyDelete
  2. hi there, this past week i had stumbled upon similar idea when trying to analyze certain industry, and happened to stumble upon your blog. I just wanted to ask, after 2 yrs, did you still think that it still a profitable idea to build this kind of database?

    ReplyDelete
  3. heyy. did you make any progress on this? Would be very interested to know more details

    ReplyDelete