Archive for January, 2011

AbiWord Recovery

January 18, 2011 4 comments

I was doing some current events articles for school and I decided that OpenOffice was too complex. I was switching between the web-browser and OpenOffice a lot. I liked how simple my web-browser was and questioned why my word processor could not be like that. So, I used another one I had heard of before, AbiWord.

It worked perfectly. I wrote up all the current events articles, that I admittedly should have been doing over the past two months, and it worked. Today, I double clicked on my current events document and AbiWord politely said:

AbiWord cannot open /home/stephen/Documents/Government/FCP_Q2/current_events.abw. It appears to be an invalid document

As you can imagine, I was panicking. I worked for hours (2 month project in one day) on it and all of it was gone. I tried double clicking it about five more times. The exact same thing over and over.

After I was convinced that my will alone was not going to fix the problem, I decided to open it in less. It was plain xml. And all my data, or at least a good portion of it, was still there.

So, I treated it like a standard data extraction job. I fired up ipython and loaded the broken file.

In[0]: data = open('current_events.abw', 'r').read()

I then imported my favorite xml library and tried to parse it.

In [2]: from lxml import etree

In [3]: doc = etree.fromstring(data)
XMLSyntaxError Traceback (most recent call last)

That explains why AbiWord couldn’t open it… Good thing lxml also has a less strict parser made for html.

In [4]: from lxml import html

In [5]: doc = html.fromstring(data)

In [6]: doc
Out[6]: <Element abiword at 93f847c>

Ok, so apparently that worked. But I still needed my data. From looking at the xml, it appears that all the data is the text and the formatting is all the tags and attributes. So, lets just strip all the xml and see what is left.

In [7]: print ''.join(doc.xpath('//text()'))

Publication: Wall Street Journal
Date: November 6, 2010
Author: Damian Paletta
Topic:GOP to Use Debt Cap to Push Spending Cuts

# More data was here

Perfect! A quick copy and paste to a sane word processor and my data was recovered. However, I still had to reformat it.

Moral? Use what you know if you plan on putting data into it you don’t want to use. Another moral? If you write software, make it a sane format. That way, there is a chance of recovery.

Categories: Uncategorized