Monday, February 18, 2013

SamShares - Parsing financial data out of annual report PDFs

What's up

I've been doing a lot of financial research, and a big chunk of that is looking through financial reports, manually copying the fields for assets, liabilities, equity, EBIT etc. It's boring as hell, and takes a long time. Why can't we automate this?

Parsing PDFs

I started by forking PyPDF2 to give me better access to the underlying objects. It's a fairly good start for working with PDFs, but just blurts out (some of) the text in a random order, which isn't what I want. This lead me down a bit of a rabbit hole and lead to me downloading a copy of the PDF 1.7 reference and browsing through this, sections 5.2 and 5.3 in particular

What's the plan?

  • Find the pages with assets/liabilites and income
  • Render them such that it's obvious where the columns and rows line up
  • Convert this to a spreadsheet
  • ???
  • PROFIT
For example, above is a screenshot from the annual report of New Zealand's largest NZX company, Fletcher Building. The PDF displays like lovely rows and columns, but can't be easily accessed in this way. If we can parse the PDF and render all the text in place, we can then make fairly accurate guesses at which rows and columns the values fall into.

Quick primer to text in PDF

Here are some of the operators you'll find for manipulating text in a PDF

BT, ET - Start and end a text object. This initialises the text matrix to the identify matrix - i.e. positioned at the top left of the document
Td, TD, T* - Operators to move the cursor to the next line
TM - Sets the text matrix. This is an affine transform, with 6 parameters - the first 4 matter for manipulating the text itself (scaling, warping, italics), and the last two essentially just set the start point for the text. This is enough for us to cheat and guess which way the text will go
Tc, Tw, Tf and lots more - Spacing and font settings

Tj, TJ - Display a text string - Tj does this simply, TJ has options after each character/substring for spacing information

Putting it all together

To parse a table out of a PDF, here's the rough idea:
  1. Locate all the strings on a page (BT/ET and TJ/Tj operators)
  2. Create a structure which ties the strings to locations (probably just Tm)
  3. Assign values row and column IDs
Once this is done, just check what is at the leftmost and topmost of each table, and use these as keys to the data. For the above image, the field "total assets" lined up with "June 2012" gives two results, so these just need to be referenced to the headers at the top, OR we can cheat and use the leftmost as this is generally the convention.

Next steps

Assuming I can make all this work, the data will then just be stored in a DB of some sort, keyed by year and company. Once this is automated enough to just pull PDFs out of NZX announcements, it'll be left in the background accumulating data, eventually building a corpus of financial data from NZX companies that can be used to make financial analysis much, much quicker and more versatile than it currently is.


Tuesday, February 12, 2013

OpenFlow 1.0 support on Juniper MX240 with JunOS 12.3

Juniper have added OpenFlow to JunOS 12.3

Do you have a spare MX240 lying around? Chuck a copy of JunOS 12.3 on it and you can get Openflow 1.0 up and running and have a play.

Details

  • Fairly full OF1.0 implementation. I don't have a spare MX240 to test, but it would appear that everything is handled in hardware (not sure how Junipers could do otherwise tbh)
  • Supports multiple VLANs - if these can be turned on and off from the controller then this would be awesome (let me know if you find this out)
  • Doesn't handle buffered packets - make sure your controller can handle OFPT_PACKET_IN messages that don't send a buffer ID (current version of POX doesn't do this?, but the betta branch does)
  • Doesn't handle TLS connectivity to the controller - not the end of the world, but I'm curious as to why this was done
  • Doesn't do anything related to STP... who cares?
  • Only supports MX240s...
This looks like a great start, well done Juniper! Here's my list of requests for the next iteration:
  • Support more than one device :) MX80's would be great, also looking to see what the EX series implementation looks like
  • Buffered packets! Everyone else does this, and it greatly speeds up the flows-per-second bottleneck between the switch and controller
That's pretty much all from me. OF1.1 support (or 1.3 as this is where everyone is going) would be awesome so we can drive MPLS, but other than that, this is fantastic news.

Update

It looks like it's not quite ready for RouteFlow - Joe Stringer pointed this out in the notes:

• If the controller pushes a flow with a set source MAC address action, the router cannot
   program the corresponding filter term. However, CLI show commands still display the
  flow with the associated action, and the device sends an OFPET_FLOW_MOD_FAILED
 error message with an OFPMFC_UNSUPPORTED code to the controller. [PR 838699]
• If the controller pushes a flow with a set destination MAC address action, the router
   cannot program the corresponding filter term. However, CLI show commands still
  display the flow with the associated action, and the device sends an
 OFPET_FLOW_MOD_FAILED error message with an OFPMFC_UNSUPPORTED code
to the controller. [PR 838709]
• If a flow contains a set IP source address action or a set IP destination address action,
   the device rejects the flow and sends an OFPET_FLOW_MOD_FAILED error m

In other words, no MAC/IP address rewrites = no routing :(

Disclaimer

I've been told that this info and the linked documents are public... If Juniper isn't happy with this, please get in touch and I'll fix it.