Monday, July 15, 2013

Building a better internet exchange with OpenFlow

Internet exchanges


The internet is a collection of cables and routers - cables join the routers together, and routers decide where your data should go. Internet exchanges perform a crucial job in the middle of this - they provide a way for different (often competing) providers to exchange internet traffic. An internet exchange is typically implemented using some number of switches that are federated together somehow so that every port can exchange traffic with every other port as fast as they need to.

Problems with internet exchanges

Many internet exchanges are implemented as a layer-2 broadcast domain, and often include a route server to make it easier for participants to exchange routes with everyone over a single peering. Layer-2 broadcast domains need to be well policed though, because people can easily mess with other peoples' traffic, either accidentally, or maliciously.

Things that you need to watch out for on a layer-2 broadcast domain:
  • Broadcast traffic - ARP requests are necessary, others probably aren't
  • MAC spoofing
  • ARP spoofing/proxy ARP
  • STP & other lovely protocols
Things that you need at an IPv4/IPv6 peering exchange
  • Unicast IPv4 and IPv6 traffic
  • ARP requests (not necessarily broadcast)
  • Legitimate ARP replies

Where's the button to turn this all on?

You'll often build an internet exchange with cheap and big switches, so you're not likely to have a huge amount of functionality to do clever stuff like this - and I've never seen a switch config before that lets me check the validity of ARP replies. To do this in software, however, is pretty easy.

Software Peering Exchange

Here's something I put together last night: https://github.com/samrussell/ryu/blob/master/ryu/app/spe.py

It's an extension of my OF1.2 learning switch, with an extra table to handle ARP entries. Here's how the tables are laid out:


The controller takes some simple JSON config of which IP is expected on each port, and loads these into the switch. Table 0 forwards all ARP packets to table 1, and table 1 checks the packets - if they're ARP requests, then it forwards them out the appropriate port (so ARP requests are effectively unicast across the internet exchange), and for ARP replies, it only allows each port to reply for its own IP address. Any packets that don't get forwarded out a port get sent to the controller (including valid ARP replies before we learn any MAC addresses, but the controller deals with this smartly & learns the MAC address properly).

This is combined with the learning switch that I posted earlier in the week, with one alteration - our source-MAC matches also have to match the ethertype, and this can be constrained in the controller to limit it to IPv4 or IPV6 (or either or - this could be done on a per port basis if you wanted to). Packets with known source-MACs are send to table 2, and then forwarded out the right port if we've learned their MAC - otherwise, it goes to the controller for processing.

What else can we do?

We can lock down the controller a bit to only allow IPv4, IPv6 and ARP - this would stop CDP, STP and the like leaking into the exchange - only valid traffic, and no spoofing.

We could go a step further though - we could build in ACLs on each port based on the routes that they announce for basic RPF, so that a participant can only route traffic from an IP range if they advertise that route to the exchange already - this would prevent asymmetric routing.

So go, have a play - see if you can break it or find something else to add!

Sunday, July 14, 2013

OpenFlow 1.2 switch app with multiple tables

Another OpenFlow switch app?

I'm sorry - there's quite a few of these, but this one is cool. I've been part of a couple of OpenFlow bootcamps in the last year and a bit, and an example that I've used is the pyswitch app that came stock with NOX. It's only a few lines of functional code, but it's a nice example to show what OpenFlow controller code looks like, and how you can easily build a learning switch from scratch. We'd go through a couple of scenarios on the whiteboard, but then we'd find a problem that OpenFlow 1.0 couldn't solve.

Why we need multiple flow tables

Let's say we have two nodes connected to the OpenFlow switch. Node 1 sends a packet to node 2, and the switch doesn't know this MAC address, so it passes the packet up to the controller. The controller learns the MAC address of node 1 (where the packet came from), and assigns it to port 1. It then pushes a flow to the controller to make sure that traffic for node 1 goes out the right port, and then sends the packet on its way.

What happens when node 2 replies?

The switch gets the packet for node 1, has a flow for it, and forwards it. There's a problem here though - because the packet never went to the controller, the switch never learns the MAC address of node 2 - so all traffic to node 2 will go via the controller and clog things up.

The OpenFlow learning switch apps that I've seen deal with this by storing pairs - they match on a destination MAC address *and* on the source MAC address - this way, whenever we get a packet from a new MAC address, we make sure the controller knows and pushes out the right flow. This works perfectly, but it's inefficient - we need O(n^2) flows, meaning 100 MAC address on a switch requires 20,000 flows (one flow in each direction = 2 flows for each pair of mac addresses).

This is where later versions of OpenFlow start to shine. We can create multiple flow tables, so we have an initial table to check the source MAC (and forward to the controller if we don't know it), and a second table to check the destination MAC (and forward to controller, or just flood, if we don't know it). This way we need to record each source MAC once, and each destination MAC once - so 100 MAC addresses on a switch only needs 200 flows - only O(n).

What it looks like


I've spent the whole weekend getting OpenVSwitch working so I could test it, but this is the final product. My good friend Josh suggested I use the Ryu controller as it has OpenFlow 1.2 and 1.3 support. It comes with a simple_switch application for OpenFlow 1.0, so I modified this to make it work with OpenFlow 1.2. There's a couple of subtle differences between the protocols, but once it was refactored for 1.2, it was pretty easy to set up a second table. The code is here if you want it, and you can just clone this repo when you want to use Ryu - it's up to date with the current github version (as of 14 July 2013), but I'm hoping they'll accept my pull request and just add my app into the main build.

Next steps

I really want to get BIRD working with an OpenFlow controller, so we'll see if that happens - I've got BIRD pushing out a stream of JSON routes that anything can pick up, and I just need this to get turned into flows. It really does feel like we're getting closer to a production OpenFlow controller that plugs into a cheap switch and takes a whole internet route table. Now is a very good time to be a network engineer!

How to build an OpenFlow testbed on an Ubuntu 12.04 VM in VirtualBox

Installing Ubuntu

I started with an Ubuntu Server 12.04.2 64 bit iso, and a VirtualBox VM with 1024MB of RAM and 8GB of hard disk. My version of VirtualBox is 4.0.10r72479 on Windows 7 x64 Professional. The install is pretty normal - if you've never installed Ubuntu Server before, you shouldn't find this too hard - just follow the prompts and keep pressing enter.

This would be a good time to pour yourself a single malt coffee

Don't be too fussy about packages, but as a rule I tend to want to install the OpenSSH server just because it's a good habit to get into - and something that'll trip you up if you forget to just that one time when it's really important.

Pro tip ™

This will take a couple of minutes to finish, and then you'll have an Ubuntu install ready to go. Remember to eject the ISO, and then turn the machine off. Time for the ugly stuff.

Configuring the network stuff

I've set up my testbed with 3x Ubuntu servers, 2 of which were set up like this and left, and a third that we did some special stuff with. For all of them, we'll need to set up extra adaptors on Internal Networks - the two client machines each get a single new adaptor with their own intnet, and the OVS machine (the third one) gets two new adaptors - the first one goes onto intnet1 (to connect to client 1), and the second goes onto intnet2. I've left the original adaptor untouched on all of the machines so we can add packages later without having to break networking.

Edit our new OFSwitch2 VM

Keep Adapter 1 as is so we can download stuff

Intnet1 matches up with client VM 1

Intnet2 to client VM 2

Once you've set this up, find the vbox file for your VM and open it up in your text editor, Make sure you close VirtualBox first - otherwise it won't take your changes. You'll want to add the following lines:

<ExtraDataItem name="VBoxInternal/Devices/e1000/1/LUN#0/Config/IfPolicyPromisc" value="allow-all"/>
<ExtraDataItem name="VBoxInternal/Devices/e1000/2/LUN#0/Config/IfPolicyPromisc" value="allow-all"/>

The secret sauce

That last part is super important - I spend a few hours today and last night trying to figure out why some packets would hit the bridge and others wouldn't - VirtualBox by default will accept broadcasts and unicasts to your address, but not other MAC addresses. Being a switch, you generally want to accept every MAC address except your own, so this is fairly important.

Installing OpenVSwitch

I've used version 1.10 because it's the coolest. Download it to a folder on your VM, untar, and read the INSTALL file because that's what cool kids do. In actual fact, there's a INSTALL.Debian, but that didn't work for me, so I just built it the generic way.

Packages to install (so you don't spend the next hour chasing dependencies):

  • build-essential
  • pkg-config
  • autoconf
  • automake
  • python-qt-dev
  • python-dev
  • python-twisted-conch
  • libtool
Then run the install
./boot.sh
./configure
make
sudo make install

I'm pleasantly surprised to say that this all worked the first time - just make sure you install all of those packages in one go and it'll work perfectly from the start :)

Running OpenVSwitch

Now is a good time to start up OpenVSwitch to test that everything is working as you should expect - if we do this right, then the OpenFlow part will be easy. Fire up your two client machines, and set up eth1 on both of them to IPs in the same range - I've used 10.1.1.1/24 and 10.1.1.2/24, but use something else if this would clash with your other network.

Once you have them up, start up OpenVSwitch with the following stuff - I've kept them in separate screens to make it easier

Start a screen (screen)

Screen 0:
sudo modprobe openvswitch
sudo ovsdb-tool create

sudo ovsdb-server --remote=ptcp:9999:127.0.0.1

New screen(CTRL+A, C)

Screen 1
sudo ovs-vswitchd tcp:127.0.0.1:9999

Screen 2
ovs-vsctl --db=tcp:127.0.0.1:9999 add-br br0
ovs-vsctl --db=tcp:127.0.0.1:9999 add-port br0 eth1
ovs-vsctl --db=tcp:127.0.0.1:9999 add-port br0 eth2
ovs-vsctl --db=tcp:127.0.0.1:9999 set bridge br0 protocols=OpenFlow12
sudo ifconfig eth1 up
sudo ifconfig eth2 up

If you bring up your client VMs you should be able to ping between them now. If you can, then great - we'll move onto getting OpenFlow working. You need one more line of code, assuming the controller is (or will be) on the same machine:

ovs-vsctl --db=tcp:127.0.0.1:9999 set-controller br0 tcp:127.0.0.1:6633

Getting OpenFlow going

We're on the home straight here. You can install the controller of your choosing, or you can install Ryu with the following instructions:

sudo apt-get install git python-setuptools
git clone http://github.com/osrg/ryu
cd ryu
sudo python setup.py install

You can then sit and watch as it downloads its dependencies from pypi. When it's done, fire up the controller with an app, and you're ready to go.

ryu-manager ryu/app/simple_switch.py

You can check the flow tables (in another screen) with the following command:

sudo ovs-ofctl dump-flows br0

Check out the manpages if you want to learn more:

You've got an OpenFlow testbed now, you can do what you want with it. Play with different controllers, or different versions of OpenFlow - it's all up to you.

OpenFlow 1.2 with Ryu PREVIEW

I've been playing with the Ryu OpenFlow controller this weekend, and I've got something that's nearly ready for you - here's a sneak preview


https://github.com/samrussell/ryu/blob/master/ryu/app/simple_switch_12.py

Tuesday, July 2, 2013

SDN plugin for the BIRD software router

A BGP router for SDN and OpenFlow

I've been playing with the BIRD software router for a couple of weeks to make it pipe out routes that I can play with. The reason for this is so that we can leverage the years of development time spent on the BGP side of things, and then simply translate the routes into flows to make a BGP router. There are already projects using RouteFlow for this, but we can simplify things a lot further - RouteFlow relies on VMs that run Quagga instances, and I feel it would be much better if the software router just talked directly to the OpenFlow controller.

What does it look like?

Every time a route update comes through to BIRD, BIRD pipes it out to a file. We can then get SDN controllers to follow this file and pull out the routes in JSON format and process as they wish to.

What's next?

Make an OpenFlow controller that polls the file and pushes out flows to your OF switch(es). This stuff isn't rocket science - just download and install and see it work for yourself!

https://github.com/samrussell/bird/tree/sam

Sunday, June 2, 2013

Brocade and Juniper Interop - OSPF, MPLS, VLL/VPLS, and VRF interconnects

From the start

We run a mix of Brocade MLX and Juniper MX80's at work, and I've spent the last week trying to make them talk to each other properly. You'd think that by 2013, a multi-vendor network would work fine using standardised protocols, but it's still quite time consuming finding which ways work and which ways most certainly don't. Oddly enough, I'm not the only person who's been working on this recently - Nick Buraglio has done a bit in the last couple of weeks too (thanks for the help on this)

As SDN starts to take over, this will become much less of a problem, but until then, here's how to do the following with Juniper MX-series routers and Brocade MLX/XMR chassis:
  • Jumbo frames
  • OSPF
  • MPLS
  • VPLS
  • VLL/l2circuit
  • Tagging VPLS/VLL/l2circuit into a VRF on Juniper
Disclaimer - the MX80 chassis has lots of stuff built in, including a tunnel services PIC - we need this for some of the stuff below, not sure how it works with bigger chassis.

Jumbo frames

This should be easy, but there's a couple of things that will trip you up if you aren't careful. The maximum frame size you can have on the Brocades is 9216 bytes, and on the Junipers it's 9192 bytes. I tried to set the frame size on the Brocades down to 9192 bytes and found a weird quirk - I could send 9146 byte pings from the Juniper, but the Brocade would only respond to 9142 byte pings - it appears the Brocades include the FCS in their count of frame size.

In the end, the best solution was to just leave both routers at their maximum values, set VPLS/VLL MTUs to 9100 bytes (or some number with a bit of headroom over 9000), and IP MTUs to 9000 (except for OSPF interfaces, but we'll get to that soon).

The main thing to remember is that on the Junipers, you can set MTU on the physical interface, or inside a "family XXX" stanza, but not directly on a logical interface. If you want to set the IP MTU for a logical interface, it sits in "interface XXX -> unit Y -> family inet -> mtu 9000"

OSPF

You *can* stand up OSPF with Jumbo frames, but it's fine with 1500 byte frames. We're in the position of introducing Junipers into our Brocade OSPF cloud, and since the defaults for Brocade are already 1500 bytes, it's easier to step the Junipers down than bring the Brocades up. I've set up the Brocade-facing interface like this:

interfaces {
    ge-1/0/0 {
        vlan-tagging;
        mtu 9192;
        unit 1000 {
            vlan-id 1000;
            family inet {
                mtu 1500
                address 10.1.2.1/30;
            }
            family mpls;
        }
    }
}

The Brocade end looks like this:

interface ve 100
 bfd interval 100 min-rx 100 multiplier 3
 ip ospf area 0
 ip ospf cost 100
 ip ospf dead-interval 40
 ip ospf hello-interval 10
 ip address 10.2.3.2/30
 ip mtu 1500
!

The only tricky bit is making sure the IP MTU is the same for each end - if you get a huge route update going into an interface that can't take the whole packet then you'll end up blackholing routes. Junipers are supposed to not stand up OSPF when there's an IP MTU mismatch, but it doesn't always work - it pays to test with ping packets to confirm - you should be able to ping in either direction with a 1472 bytes of data (1500 - 20 byte IP header - 8 byte ICMP header).

MPLS

This is pretty straightforward - set up loopback interfaces on each end, and enable RSVP and LDP. We'll use RSVP for the outer tags, and LDP for the inner tags - no clever BGP signalling here.

Juniper:

protocols {
    rsvp {
        interface all {
            disable;
        }
        interface ge-1/0/0.1000;
        interface lo0.0;
    }
    mpls {                              
        label-switched-path 1-to-2 {
            from 10.0.0.1;
            to 10.0.0.2;
            fast-reroute;
        }
        label-switched-path 1-to-3 {
            from 10.0.0.1;
            to 10.0.0.3;
            fast-reroute;
        }
        interface all {
            disable;
        }
        interface ge-1/0/0.1000;
    }
    ospf {
        traffic-engineering;
        area 0.0.0.0 {
            interface all {
                disable;
            }
            interface ge-1/0/0.1000 {
                hello-interval 3;       
                dead-interval 12;
                bfd-liveness-detection {
                    minimum-interval 300;
                    multiplier 3;
                }
            }
            interface lo0.0 {
                passive;
            }
        }
    }
    ldp {
        interface all {
            disable;
        }
        interface lo0.0;
    }
}

Brocade:

router mpls
 policy
  traffic-eng ospf


 mpls-interface ve100


 path 3-to-1
  loose 10.0.0.1                                                  

 path 3-to-2
  loose 10.0.0.2

 path S3-to-1
  loose 10.0.0.1

 path S3-to-2
  loose 10.0.0.2


 lsp LSP-3-to-1
  to 10.0.0.1
  primary 3-to-1
  secondary S3-to-1
    standby
  frr
  revert-timer 30
  enable

 lsp LSP-3-to-2
  to 10.0.0.2
  primary 3-to-2                                                  
  secondary S3-to-2
    standby
  frr
  revert-timer 30
  enable

VPLS

This is where it gets interesting. In my opinion, Brocade does the right thing (packets come out of a VLAN-tagged "pipe", and then go into a VPLS "pipe"), whereas Juniper does it at a lower and less-abstract level (packets with headers that get altered) - it's more flexible, but it's harder to make it do the right thing.

Raw mode, tagged mode, and tags inside raw mode

VPLS has two modes - Raw mode creates a broadcast domain between all peers on the same VPLS, whereas tagged mode allows you to use inner tags within a VPLS circuit. The problem comes where Juniper sends 802.1q VLAN-tagged packets through a raw-mode VPLS. You end up left with a situation where traffic can go one way, but not the other, and it's all quite confusing.

Raw mode interop

Check out these configs:

On Juniper, we make a routing instance, and add interfaces into it. They can be VLAN-tagged or normal access ports, and there's a very important trick to make it all use raw mode properly - the line "vlan-id none". If you don't do this, packets on an untagged port go through fine, but packets from tagged ports come through with 802.1q VLAN tags on them. On Brocade, a VPLS get delivered to a mix of tagged and untagged ports, but all traffic is sent as normal untagged ethernet. The "vlan-id none" line makes the Junipers behave in the same way. The config below delivers VPLS 40 untagged at both ends, and VPLS 140 tagged as VLAN 140.

Don't worry too much about the MTUs - they need to match up, but they don't appear to be enforced. We picked 9100 as it's well under the 9192 byte hardware MTU, but well above the 9000 byte IP MTU - a bit of leeway in each direction.

Juniper:

routing-instances {
    vpls-40 {
        description vpls-40;
        instance-type vpls;
        vlan-id none;
        interface ge-1/1/9.40;           
        protocols {
            vpls {
                no-tunnel-services;
                vpls-id 40;
                mtu 9100;
                neighbor 10.0.0.2;
                neighbor 10.0.0.3;
            }
        }
    }
    vpls-140 {
        description vpls-140;
        instance-type vpls;
        vlan-id none;
        interface ge-1/1/9.140;           
        protocols {
            vpls {
                no-tunnel-services;
                vpls-id 140;
                mtu 9100;
                neighbor 10.0.0.2;
                neighbor 10.0.0.3;
            }
        }
    }
}
interfaces {
    ge-1/1/9 {
        flexible-vlan-tagging;
        native-vlan-id 40;
        mtu 9192;
        encapsulation flexible-ethernet-services;
        unit 40 {
            encapsulation vlan-vpls;
            vlan-id 40;
            family vpls;
        }
        unit 140 {
            encapsulation vlan-vpls;
            vlan-id 140;
            family vpls;
        }
    }
}

Brocade:

router mpls
 vpls vlan40 40 
  vpls-peer 10.0.0.1 10.0.0.2
  vpls-mtu 9100
  vlan 40
   untagged ethe 1/5 


 vpls vlan40 40 
  vpls-peer 10.0.0.1 10.0.0.2
  vpls-mtu 9100
  vlan 140
   tagged ethe 1/5 

If you're interested in the mechanics behind the Juniper implementation, the "show interface" command gives you a bit of insight - Juniper interprets the "vlan-id none" line in the routing instance and converts that to tag push/pop operations on the interface:

  Logical interface ge-1/1/9.40 (Index 332) (SNMP ifIndex 564) 
    Flags: SNMP-Traps 0x0
    VLAN-Tag [ 0x8100.40 ] Native-vlan-id: 40 In(pop) Out(push 0x8100.40) 
    Encapsulation: VLAN-VPLS
    Input packets : 9 
    Output packets: 7
    Protocol vpls, MTU: 9192            
      Flags: Is-Primary

  Logical interface ge-1/1/9.140 (Index 333) (SNMP ifIndex 597) 
    Flags: SNMP-Traps 0x0 VLAN-Tag [ 0x8100.140 ] In(pop) Out(push 0x8100.140) 
    Encapsulation: VLAN-VPLS
    Input packets : 149 
    Output packets: 92
    Protocol vpls, MTU: 9192
      Flags: Is-Primary

VLL/l2circuit

This is the fun one. Tagged-mode VLLs are the easiest to get up and running, but raw-mode should be doable too. There is one problem though - the only way I can make raw-mode work on the Junipers looks like a filthy hack, but it produces the same results that we see above in the "show interface" output for VPLS.

First off, configs for tagged mode

Brocade:

router mpls
 vll vlan42 42
  vll-mtu 9100
  vll-peer 10.0.0.1
  vlan 42
   tagged e 1/5

Juniper:

interfaces {
    ge-1/1/9 {
        flexible-vlan-tagging;
        mtu 9192;
        encapsulation flexible-ethernet-services;
        unit 42 {
            encapsulation vlan-ccc;
            vlan-id 42;                     
            family ccc;
        }
    }
}

protocols {
    l2circuit {
        neighbor 10.0.0.3 {
            interface ge-1/1/9.42 {
                virtual-circuit-id 42;
                mtu 9100;
                encapsulation-type ethernet-vlan;
            }
        }
    }
}

This is all pretty straightforward, and works out of the box. Here's what the raw mode config looks like

Brocade:

router mpls
 vll vlan41 41 raw-mode
  vll-mtu 9100
  vll-peer 10.0.0.1
  vlan 41                                                         
   tagged e 1/5

Juniper:

interfaces {
    ge-1/1/9 {
        flexible-vlan-tagging;
        mtu 9192;
        encapsulation flexible-ethernet-services;
        unit 41 {
            encapsulation vlan-ccc;
            vlan-id 41;
            input-vlan-map pop;
            output-vlan-map push;
            family ccc;
        }
    }
}
protocols {
    l2circuit {
        neighbor 10.0.0.3 {
            interface ge-1/1/9.41 {
                virtual-circuit-id 41;
                mtu 9100;
                encapsulation-type ethernet;
            }
        }
    }
}

As you can see, the default for Brocade is tagged mode, so we need to explicitly put it in raw mode. On the Juniper end, we set the encapsulation type on the VLL to ethernet instead of ethernet-vlan, but this only works with encapsulation ethernet-ccc on the physical interface. If we have VLAN tagging mode enabled, there doesn't seem to be any way to tell the MX80 about this. The way I've made this work is with the "input-vlan-map" and "output-vlan-map" statements - they seem to round everything out and make it all work. Given the default for both Juniper and Brocade is tagged mode, and we need a bit of mad hax to make raw mod work here, it might make sense to use tagged mode.

Tagging circuits to VRFs (Juniper only)

This was my favourite part. The way to do this seems to be lt- devices, which means you need to set up the tunnel services PIC (if you have one - the MX80's have one built in).

chassis {
    fpc 0 {
        pic 0 {
            tunnel-services {
                bandwidth 1g;
            }
        }
    }
}

This next part took a while to figure out, but it totally works - you just need to make sure you match up the encapsulations and it all works fine.

Here's some config:

routing-instances {
    vrf {
        instance-type vrf;
        interface lt-0/0/10.1;
        route-distinguisher 1:2;
        vrf-target target:1:2;
        vrf-table-label;
    }
    vrf-43 {
        instance-type vrf;
        interface lt-0/0/10.3;
        route-distinguisher 1:42;
        vrf-target target:1:42;
        vrf-table-label;
    }
    vpls-40 {
        description vpls-40;
        instance-type vpls;
        vlan-id none;
        interface lt-0/0/10.2;           
        protocols {
            vpls {
                no-tunnel-services;
                vpls-id 40;
                mtu 9100;
                neighbor 10.0.0.2;
                neighbor 10.0.0.3;
            }
        }
    }
}
protcols {
    l2circuit {
        neighbor 10.0.0.2 {
            interface lt-0/0/10.4 {
                virtual-circuit-id 43;
                mtu 9100;
                encapsulation ethernet-vlan;
            }
        }
    }
}
interfaces {
    lt-0/0/10 {
        mtu 9192;
        unit 1 {
            encapsulation ethernet;
            peer-unit 2;
            family inet {
                mtu 9000;
                address 192.168.0.13/24;
            }
        }
        unit 2 {
            encapsulation ethernet-vpls;
            peer-unit 1;
        }
        unit 3 {
            encapsulation ethernet;
            peer-unit 4;
            family inet {
                mtu 9000;
                address 192.168.43.13/24;
            }
        }
        unit 4 {
            encapsulation ethernet-ccc;
            peer-unit 3;
        }
    }
}

This took about a day to get working, but it's totally simple once it matches up. The lt- devices are instead of crossover cables hanging out the back of your router, and the key is to just set the encapsulation types correctly. You can cheat a little with irb interfaces in VPLS routing instances, but this alters the route table directly on the chassis. Doing it this way means it's locked down to a VRF, and everything is a bit nicer.

End

I hope you've enjoyed this - let me know if there's anything I've missed or got wrong. The moral of the story here is - Juniper and Brocade can do VPLS and VLLs fine between each other - just watch out for the little quirks that would trip you up.

Monday, February 18, 2013

SamShares - Parsing financial data out of annual report PDFs

What's up

I've been doing a lot of financial research, and a big chunk of that is looking through financial reports, manually copying the fields for assets, liabilities, equity, EBIT etc. It's boring as hell, and takes a long time. Why can't we automate this?

Parsing PDFs

I started by forking PyPDF2 to give me better access to the underlying objects. It's a fairly good start for working with PDFs, but just blurts out (some of) the text in a random order, which isn't what I want. This lead me down a bit of a rabbit hole and lead to me downloading a copy of the PDF 1.7 reference and browsing through this, sections 5.2 and 5.3 in particular

What's the plan?

  • Find the pages with assets/liabilites and income
  • Render them such that it's obvious where the columns and rows line up
  • Convert this to a spreadsheet
  • ???
  • PROFIT
For example, above is a screenshot from the annual report of New Zealand's largest NZX company, Fletcher Building. The PDF displays like lovely rows and columns, but can't be easily accessed in this way. If we can parse the PDF and render all the text in place, we can then make fairly accurate guesses at which rows and columns the values fall into.

Quick primer to text in PDF

Here are some of the operators you'll find for manipulating text in a PDF

BT, ET - Start and end a text object. This initialises the text matrix to the identify matrix - i.e. positioned at the top left of the document
Td, TD, T* - Operators to move the cursor to the next line
TM - Sets the text matrix. This is an affine transform, with 6 parameters - the first 4 matter for manipulating the text itself (scaling, warping, italics), and the last two essentially just set the start point for the text. This is enough for us to cheat and guess which way the text will go
Tc, Tw, Tf and lots more - Spacing and font settings

Tj, TJ - Display a text string - Tj does this simply, TJ has options after each character/substring for spacing information

Putting it all together

To parse a table out of a PDF, here's the rough idea:
  1. Locate all the strings on a page (BT/ET and TJ/Tj operators)
  2. Create a structure which ties the strings to locations (probably just Tm)
  3. Assign values row and column IDs
Once this is done, just check what is at the leftmost and topmost of each table, and use these as keys to the data. For the above image, the field "total assets" lined up with "June 2012" gives two results, so these just need to be referenced to the headers at the top, OR we can cheat and use the leftmost as this is generally the convention.

Next steps

Assuming I can make all this work, the data will then just be stored in a DB of some sort, keyed by year and company. Once this is automated enough to just pull PDFs out of NZX announcements, it'll be left in the background accumulating data, eventually building a corpus of financial data from NZX companies that can be used to make financial analysis much, much quicker and more versatile than it currently is.