Handling CSV files from financial institutions

John Lewis recently revamped the web site for its credit card (the Partnership Card). This caused much angst within the middle classes. I didn't encounter any of the problems mentioned but was bothered by something else entirely.

On the original web site, the only way to obtain the credit card transactions for importing into Microsoft Money was to screen-scrape. Year's ago, I'd developed a set of tools built on shell and awk to automate the conversion of such finance data to QIF files for import into Money. However, the new website offered a CSV file download capability. Figuring this might make obtaining the transactions easier, I downloaded a sample and took a look at the file in emacs. It looked like this:

"^@M^@R^@ ^@M^@A^@R^@C^@ ^@T^@I^@L^@L^@S^@O^@N^@
^@T^@r^@a^@n^@s^@a^@c^@t^@i^@o ^@n^@s^@ ^@e^@x^@p^@o^@r^@t^@e^@d^@
^@o^@n^@ ^@2^@0^@1^@6^@-^@1^@0^@-^@2^@9^@ ^@1^@6^@:^@5^@1^@"^@^M^@

A hex dump made things a bit clearer...

00000000  22004D00 52002000 4D004100 52004B00  *".M.R. .M.A.R.C.*
00000016  20005700 49004C00 4C005300 4F004E00  * .T.I.L.L.S.O.N.*
00000032  20005400 72006100 6E007300 61006300  * .T.r.a.n.s.a.c.*
00000048  74006900 6F006E00 73002000 65007800  *t.i.o.n.s. .e.x.*
00000064  70006F00 72007400 65006400 20006F00  *p.o.r.t.e.d. .o.*
00000080  6E002000 32003000 31003600 2D003100  *n. .2.0.1.6.-.1.*

While Excel was quite happy with the file format, clearly shell and awk weren't going to cut it anymore, and awk can't handle csv files very well anyway. I needed something that could process what looked like a UTF-16 csv files properly. Time to re-write the tools in python.

The mkqif.py script is the result of that re-write. Now, handling John Lewis data is easier, since no screen-scraping is involved.

I did ask John Lewis (via the Message facility) why they chose UTF-16. The three, mostly duplicated responses, suggested my dissatisfaction should be raised with the Financial Ombudsman. Who knew they offered IT support?

Update: 2017-02-04

The character encoding of the CSV file changed around this time. UTF-16 was replaced by the ISO8859-1 (latin1) character set. Another header row was also added. The current version of mkqif.py handles this new format.

Update: 2019-02-10

Another rewrite, this time to really use classes properly. FIBase is renamed CVSFormat (becuase that's what it really represents) and the institutions are instances of CVSFormat, using constructor arguments to reflect the differences in formatting.

Update: 2022-06-24

Source code for mkqif.py is now available on Github.