I was stuck in a class on campus and wanted to get on IRC (yes, it was a boring class). My school blocks port 7000. This is normal because IRC is used by botnets for communication. So, I did what I normally do in these situations. I attempted to ssh to a server at work and setup port forwarding.
ssh: connect to host myhost port 22: No route to host
This ticked me off. Who blocks port 22?! Anyways, before getting on IRC was not really important. But now this had gotten personal. God knows I will need to ssh from school at some point for work. I needed to find a way to fix this.
This is where using EC2 is awesome. The Amazon Linux and Ubuntu AMIs (server images) allow you to pass a script to run on first boot. All you need to do is put a shebang at the top of the user data that is given to amazon when you start the instance.
echo >> /etc/ssh/sshd_config
echo "Port 443" >> /etc/ssh/sshd_config
That was all it needed. The first echo is probably not necessary, but it ensured the file ended with a newline. After a reboot (protip: /etc/init.d/sshd restart), I was able to use ssh.
Thank god for the universal firewall bypass ports: 80 and 443!
I was doing some current events articles for school and I decided that OpenOffice was too complex. I was switching between the web-browser and OpenOffice a lot. I liked how simple my web-browser was and questioned why my word processor could not be like that. So, I used another one I had heard of before, AbiWord.
It worked perfectly. I wrote up all the current events articles, that I admittedly should have been doing over the past two months, and it worked. Today, I double clicked on my current events document and AbiWord politely said:
AbiWord cannot open /home/stephen/Documents/Government/FCP_Q2/current_events.abw. It appears to be an invalid document
As you can imagine, I was panicking. I worked for hours (2 month project in one day) on it and all of it was gone. I tried double clicking it about five more times. The exact same thing over and over.
After I was convinced that my will alone was not going to fix the problem, I decided to open it in
less. It was plain xml. And all my data, or at least a good portion of it, was still there.
So, I treated it like a standard data extraction job. I fired up
ipython and loaded the broken file.
In: data = open('current_events.abw', 'r').read()
I then imported my favorite xml library and tried to parse it.
In : from lxml import etree
In : doc = etree.fromstring(data)
XMLSyntaxError Traceback (most recent call last)
That explains why AbiWord couldn’t open it… Good thing lxml also has a less strict parser made for html.
In : from lxml import html
In : doc = html.fromstring(data)
In : doc
Out: <Element abiword at 93f847c>
Ok, so apparently that worked. But I still needed my data. From looking at the xml, it appears that all the data is the text and the formatting is all the tags and attributes. So, lets just strip all the xml and see what is left.
In : print ''.join(doc.xpath('//text()'))
Publication: Wall Street Journal
Date: November 6, 2010
Author: Damian Paletta
Topic:GOP to Use Debt Cap to Push Spending Cuts
# More data was here
Perfect! A quick copy and paste to a sane word processor and my data was recovered. However, I still had to reformat it.
Moral? Use what you know if you plan on putting data into it you don’t want to use. Another moral? If you write software, make it a sane format. That way, there is a chance of recovery.