Home > Uncategorized > AbiWord Recovery

AbiWord Recovery

I was doing some current events articles for school and I decided that OpenOffice was too complex. I was switching between the web-browser and OpenOffice a lot. I liked how simple my web-browser was and questioned why my word processor could not be like that. So, I used another one I had heard of before, AbiWord.

It worked perfectly. I wrote up all the current events articles, that I admittedly should have been doing over the past two months, and it worked. Today, I double clicked on my current events document and AbiWord politely said:

AbiWord cannot open /home/stephen/Documents/Government/FCP_Q2/current_events.abw. It appears to be an invalid document

As you can imagine, I was panicking. I worked for hours (2 month project in one day) on it and all of it was gone. I tried double clicking it about five more times. The exact same thing over and over.

After I was convinced that my will alone was not going to fix the problem, I decided to open it in less. It was plain xml. And all my data, or at least a good portion of it, was still there.

So, I treated it like a standard data extraction job. I fired up ipython and loaded the broken file.

In[0]: data = open('current_events.abw', 'r').read()

I then imported my favorite xml library and tried to parse it.


In [2]: from lxml import etree

In [3]: doc = etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError Traceback (most recent call last)

That explains why AbiWord couldn’t open it… Good thing lxml also has a less strict parser made for html.


In [4]: from lxml import html

In [5]: doc = html.fromstring(data)

In [6]: doc
Out[6]: <Element abiword at 93f847c>

Ok, so apparently that worked. But I still needed my data. From looking at the xml, it appears that all the data is the text and the formatting is all the tags and attributes. So, lets just strip all the xml and see what is left.


In [7]: print ''.join(doc.xpath('//text()'))
application/x-abiwordAbiWord

Publication: Wall Street Journal
Date: November 6, 2010
Author: Damian Paletta
Topic:GOP to Use Debt Cap to Push Spending Cuts

# More data was here

Perfect! A quick copy and paste to a sane word processor and my data was recovered. However, I still had to reformat it.

Moral? Use what you know if you plan on putting data into it you don’t want to use. Another moral? If you write software, make it a sane format. That way, there is a chance of recovery.

Advertisements
Categories: Uncategorized
  1. Richard
    August 15, 2011 at 10:35 pm

    Thanks for the tip… simply copying my mashed up AbiWord “.abw” to a “.xml” meant that I could use Internet Explorer to instantly tell me which line in the xml was damaged. It was down to a hyperlink that hadn’t been output correctly. A quick edit with notepad and AbiWord was able to read it again.

    However the lesson I’ve taken from this is not to trust AbiWord!

    If only it would preserve the input formatting correctly when importing M$ “.docx” files then I could continue with Word 2010, and only resort to abiword to do the automated conversion to PDF overnight, as the ‘–plugin AbiCommand’ works fantastically with genuine abi documents.

    Great advice though, your idea of parsing the XML to find where the ‘bug’ was got me straight onto the right track, cheers.

  2. nikita
    August 21, 2012 at 9:14 pm

    Hi!
    I was interested in your solution,but I am just amateur.Nevertheless I tried it.
    This is what happend to me:
    Python 3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)] on win32
    Type “copyright”, “credits” or “license()” for more information.
    >>> f=open(‘a.abw’,’r’).read()
    >>> from lxml import etree
    >>> doc = etree.fromstring
    >>> from lxml import html
    >>> doc = html.fromstring
    >>> doc

    >>> print (”.join(doc.xpath(‘//text()’))
    application/x-abiwordAbiWord

    SyntaxError: invalid syntax

    I am very sad of it!This was my story,which I wrote two months.Can you help me?Please

    • August 21, 2012 at 10:08 pm

      In your code, you never passed the the data from the file (f) to the html.fromstring function. Instead, you assigned the fromstring function to doc. Try the code below.


      f=open(‘a.abw’,'r’).read()
      from lxml import html
      doc = html.fromstring(f)
      print(''.join(doc.xpath('//text()')))

  3. nikita
    August 23, 2012 at 4:28 pm

    Ooops….thanks,but for me it doesn’t work…this is what happend:
    >>> f=open(‘a.abw’,’r’).read()
    >>> from lxml import etree
    >>> doc=etree.fromstring(f)
    Traceback (most recent call last):
    File “”, line 1, in
    doc=etree.fromstring(f)
    File “lxml.etree.pyx”, line 2756, in lxml.etree.fromstring (src/lxml\lxml.etree.c:54726)
    File “parser.pxi”, line 1578, in lxml.etree._parseMemoryDocument (src/lxml\lxml.etree.c:82843)
    File “parser.pxi”, line 1450, in lxml.etree._parseDoc (src/lxml\lxml.etree.c:81576)
    File “parser.pxi”, line 925, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml\lxml.etree.c:78000)
    File “parser.pxi”, line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml\lxml.etree.c:74567)
    File “parser.pxi”, line 650, in lxml.etree._handleParseResult (src/lxml\lxml.etree.c:75458)
    File “parser.pxi”, line 601, in lxml.etree._raiseParseError (src/lxml\lxml.etree.c:74958)
    File “”, line None
    lxml.etree.XMLSyntaxError:
    >>> from lxml import html
    >>> doc=html.fromstring(f)
    Traceback (most recent call last):
    File “”, line 1, in
    doc=html.fromstring(f)
    File “C:\Python32\lib\site-packages\lxml\html\__init__.py”, line 634, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
    File “C:\Python32\lib\site-packages\lxml\html\__init__.py”, line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
    File “lxml.etree.pyx”, line 2756, in lxml.etree.fromstring (src/lxml\lxml.etree.c:54726)
    File “parser.pxi”, line 1578, in lxml.etree._parseMemoryDocument (src/lxml\lxml.etree.c:82843)
    File “parser.pxi”, line 1450, in lxml.etree._parseDoc (src/lxml\lxml.etree.c:81576)
    File “parser.pxi”, line 925, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml\lxml.etree.c:78000)
    File “parser.pxi”, line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml\lxml.etree.c:74567)
    File “parser.pxi”, line 650, in lxml.etree._handleParseResult (src/lxml\lxml.etree.c:75458)
    File “parser.pxi”, line 601, in lxml.etree._raiseParseError (src/lxml\lxml.etree.c:74958)
    File “”, line None
    lxml.etree.XMLSyntaxError:
    I don’t want to be annoying,but can you tell me,what I’d do?It is some chance,that I’ll save my file?

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: