Steve's picture

writing a crawler - relative to absolute urls

Ok, as I write my crawler, I've noticed that there are many pieces to the whole crawler puzzle.  I've got the part that pulls pages off the internet as fast as possible (that part works great), but that part is the easy part.  The hard part is doing something useful with the data once I've fetched it.  The first step is pulling URLs from the html for future crawling.  Storing/de-duping/… of urls is a whole other problem with doing a huge crawl, so I'll cover that later.

When I was working with the http://archiveteam.org guys pulling down geocities urls, I used my old crawler.  It just pulled urls and didn't think very hard about things.  However, after the fact, I realized that crawlers convert relative urls (i.e. "blah.html", "../blah.html", "../dir2/blah.html", "/blah.html") to absolute urls (i.e. http://blah.com/blah.html).  Many crawlers do this internally and the code to do it is not found easily on the internet and I think that I know why.

The reason there isn't just a regex or some code out there to convert relative urls to absolute urls is because urls are not always the same.  Computers don't write web pages most of the time, people do, so there are mistakes in them.

If a file (blah.html) is in a directory called "a", the absolute url could be http://blah.com/a/blah.html  However, if blah.html has links in it to "../../blah.html" – that's not a valid url, so it's catchable.  "../blah.html" is fine.  "../b/blah.html" is fine, "c/blah.html" is fine.  Get it?

But what happens if an original url is http://blah.com/a (no trailing slash).  Is the crawler pulling http://blah.com/a ("a" is the name of the file) or http://blah.com/a/index.html or http://blah.com/a/index.php or http://blah.com/a/index.htm ???  Also, if there are relative urls in that file, like "blah.html", does that mean http://blah.com/blah.html or http://blah.com/a/blah.html ???  The answer is that you have to crawl both.  The web server will serve up content in either case, and not all people put trailing slashes in their urls if they mean a directory, so you can't assume that you know which one is best, so you need to crawl both cases.  That's why there's no code out there to figure it out.  There's no one, good answer.

Steve's picture

How just one popular site could hack anyone's account(s) on the internet

Call me paranoid, but ...

Ok, take facebook or Google or twitter or some other popular website on the internet. They all require accounts using an id (usually your email address) and a password to get access to, right? So what happens when you forget your password on a certain website? You try several of your 'standard' passwords that you use for those sorts of websites, or you use your password that you use everywhere, or a variation of it until you exhaust all of your known variations, then you hit the "I forgot my password" link to have them email your password, right?

Well, let's say (just for argument's sake) that one of these sites monitors all of the password variations that you type into the password box when you get locked out of your account. They could build a list of all of your possible passwords for use later. They could have your id, email address and list of possible passwords to try to get into almost any other website! The one website that comes to mind eight off the top of my head is Google. They love to mine data and they love to get their greedy little hands on as much of it as they can. We've all forgotten our passwords to a website at one time or another. Google probably has a list of possible passwords that any one user uses on a variety of websites - plus through gmail, they could just snoop your email to see what your passwords are anyway, since they email you your password for just about every account you generate on the web not to mention the data that they could get access to and index if they could log in as people on websites... Let's forget about Google for now, since they're the obvious company that already knows everything about you that they need to know and if you're an avid gmail, gdocs, gchat user, probably hack into any of your accounts if they so desired and let's focus on a different major website - I could pick any, but facebook is super-huge, let's pick on them.

Facebook knows your location, education, friends, email address(es), hobbies, and possibly all of your online passwords. With this information, facebook could blackmail people, get into their bank accounts, their work accounts, ... Pretty much hack into any account on the internet that requires an id/pw combination and is protected by some sort of personal information to reset your password. Social sites know so much about you - the name of your dog, where you went to school, where you went on your honeymoon, ... Pretty much any question on that "I forgot my password" list of questions that you select a couple from to use as your password reset Q/A security step can be found on any social site today.

I love google and google apps. It would suck to go cold-google-turky and use desktop apps and not have the flexibility of using all of the Google tools that I come to rely on on a daily basis. I use Google Mail, Calendar, Reader, Docs, Voice, a personalized google search engine, and I'm sure that I'll be using Google Wave when it's available too. I'm a google junky and I also post stuff to facebook, twitter and other sites semi-regularly. I'm hooked-in and I don't want to un-hook if at all possible. So, what can I do about it?

The obvious choice: multiple password types. I have a Google-only password. I have a work-only password. I have an ebay/paypal-only password. I have a bank/finance-only password. I have a personal-machines-only password. I have several passwords and password variations for my online accounts that I don't care about. I need to add another one, a social-only password. Also, none of these passwords should be related or similar to each other in any way. Different phrases, different subjects, different number combinations.

This gives me 6 semi-secure passwords, and one 'pool' of crap passwords for sites I don't care about. It's hard to keep these passwords all sorted out, but I make the changes one at a time, and if I'm on a bank website, I know not to use any of my other passwords or password variations. Also, social sites don't get to even snoop in on my bank-only password attempts since I won't even be trying them in the password field at all if/when I forget my password.

Writing down passwords is a general security no-no. There are password-remembering tools that encrypt your passwords, Firefox and IE try to remember IDs and passwords for you as best that they can. There are password syncing utilities out there to keep your browsers in sync when you type something in at home and go to work and want the browser to remember it for you. Those are all fine, and I use them too mostly for convenience-sake, but I don't inherently trust them. Your browser can be used when you step away at work to use the restroom. Your encrypted password app can be lost if your OS crashes. Storing them online is just a bad idea.

When signing up for a bank account, or a 'serious' online service (your CC company, paypal, ...), don't use any of your online email accounts as an address. I'm lucky and I have my own email server, so I use it as my email address for those more serious accounts. This way, if I do have to hit the "I forgot my password" button, the new password gets sent to a known-secure email address that only I have access to.

So how can you remember your passwords? I tend to think that 'security through obscurity' is one of the best ways to go. If I build a website of my own and I'm worried about someone hacking it, I will write a custom website. I won't use drupal or wordpress or phpBB - those are sure to get hacked just a couple of months after I install them. The same thing goes for keeping your passwords safe. If you write them down, put them somewhere safe, I mean *REALLY* safe, like a safety deposit box or something - now that's a huge inconvenience if you just need to log into your bank and see your statement or something, but you see where I'm going. Generate a procedure that's not online, that is secure and has some sort of backup/redundancy to it. Then you'll be able to say without a doubt that you have a safe location for all of your passwords that will still be available if your home machine dies, if some sort of disaster happens, etc ...

Ok, enough ranting for today. Let me know below if you have any other ideas.

Steve's picture

The way to do url-shortening right

http://www.adjix.com has figured out the solution to bit-rot with url-shorteners. They have all of the standard stuff with regards to the normal url-shorteners, but they do one thing interesting. They allow for the user to back up their shortened url links to an amazon S3 bucket. What does this do? It allows you to OWN your shortened links! Yea, I set up a CNAME of url.badcheese.com to point to my amazon S3 bucket, gave adjix permission to write to my bucket and now when I make a shortened url, it fetches an actual file off of S3 and does the redirect through javascript. Why would someone like to own their shortened urls? Well, bit-rot http://en.wikipedia.org/wiki/Bit_rot is a problem with url-shorteners. If a url-shortener goes away, your tweets, IMs and all of your content looses a LOT of meaning. Take for example, the tweet: "Check this out: http://tinyurl.com/xxxx". Meaningless without the following page, right? If I own my own shortened urls, then I can take those shortened urls and move them around from machine to machine and do whatever I want to with them. I know that they'll always be there because I own them, and if I want to remove one, I just nuke it. Simple enough!

Anyway, hats off to the guys at http://www.adjix.com , also - I sent in a bug report last night at 11:30pm MT, and they answered and fixed the bug by the time I got up in the morning, so they're midnight hacker types and could really be onto something! I'm a big fan already.

Google Stock Chart on Google Finance: http://url.badcheese.com/57m5

Steve's picture

Google Voice needs a little work for the transcription service. :)

Here's the transcription text:

Hello in. Thank you for trying. I respect Radiance patented information system. Hi Miss provide crappy communications any device for Texas beach or voice recording. I received the state of the our solution with multiple features including dynamic grouping survey and conference calling for more information about I read, please contact us at (281) 263-6300 or visit us on the website at W W W dot used Iris, dot, com This message was brought you by tyrant press 1 to repeat the message, Hello in. Thank you for trying. I respect Radiance patented information system. Hi Miss provide crappy communications any device for Texas beach or voice recording. I received the state of the our solution with multiple features including dynamic grouping survey and conference calling for more information about I read, please contact us at (281) 263-6300 or visit us on the website at W W W dot you. Cyrus dot com.

Steve's picture

TechStars 2009 demo day companies

http://Reteltechnologies.com - mostly just uses something like amazon's mechanical turk to have people watch surveillance videos and flag when something happens. Good for guys who own a 7-11 store, but I wouldn't use it.

http://everlater.com - travel site that combines many social sites to tell a story of your trip (the guys were very good at selling the product, but I don't think that I'd use it)

http://timzon.com - was just a customer support tool, I wouldn't use it.

http://TakePublishing.com - itunes for comic books. I'm not a huge comic book fan, but it has potential.

http://NextBigSound.com - for big record labels or bands to monitor which social site they need to focus their marketing on.

http://vanillaforums.com - I would use it! Great forum software - used by tons of big-boys.

http://sendgrid.com - Better 'transactional' email, like the email that you get from facebook and twitter (aka, Mr Jones is now following you on twitter). I don't need that.

http://SpryPlanner.com - Manager software for high-level overview of a software development package/group. Integrates into nagios, github, etc ... Not something I need.

http://Mailana.com - the least impressive tool, like linked-in, but for the major social sites. Combines all of your friends and makes meta-friend circles out of all of your 'best' friends and allows for you to search within those circles.

http://Rezora.com - The real estate email marketing tool. Impressive tool, but I don't have a need for it.

Steve's picture

Finally a video that explains some of the things that I do in a language that even my mom can understand

Yahoo engineers explain their new homepage design/architecture with closed-captioning for the geek-impared.

http://cosmos.bcst.yahoo.com/up/player/popup/?cl=14758196

Steve's picture

I spend too much time online to deal with crappy JavaScript

I must spend 10-12 hours on a computer every day most of it online in one form or another.

I've got a quad-core computer with 4GB of ram, a $300 Nvidia video card, the latest browser (Firefox 3.5) which has a turbo-speed JavaScript engine in it, but for some reason, my browser seems to hang every few minutes while waiting for something to happen and crappy resolution internet video is choppy and difficult to bear.  I must waste 30-60 minutes a day just waiting for my browser to wake up from doing something behind the scenes.  To put it bluntly, this pisses me off.

I'm an old-school Unix guy.  I use pine for my email reader.  I use xterms instead of the nicer gnu terminals, I turn off all of the fancy special effects in the window manager, …  I prefer a minimalistic, simple, not overly complicated interface to do my work.  When I put the network and supporting infrastructure for the company that I work for, I chose as simple as possible so that anyone can jump in and understand things if that needed to happen.  I do these things for many reasons, but the main reason is that complexity costs resources.  Simple = fast and manageable.  With today's computers, everything should be lightening-fast and waiting should be a thing of the past, but it isn't.  It seems that I'm waiting just as much nowadays as I was with my 4.7 MHz IBM PC XT back in the late 70's.  Sure, things have changed, but today software seems like it's written just slow enough to stay within human tolerances to being slow.  People optimize just enough so that it's acceptably fast, and no more.  To put it bluntly, this also pisses me off.

I want to build a website that competes with the big-boys and has all of the bells and whistles as the other guys.  It will be pretty and serve a function, but it will have one big difference.  It'll be lightening-fast!  It won't hang the user's browser doing things that could be done differently.  It won't thrash the user's swap just to go to the second page of search results.  It will transfer just enough data to get the job done, no more.  The JavaScript that it uses will be clean and fast and will not cause the browser to thrash or swap.  The infrastructure on the server-side will not use the latest/greatest technology just because it exists and might make things 2% easier to write or maintain.  I'll write the backend code as optimized as I possibly can.  I'll write C code when it needs to be tight, fast and small.  Use IP numbers instead of hostnames to avoid the 100ms DNS query.  Use a BerkeleyDB instead of SQL wherever necessary.  Send compressed, pipe-delimited text as the protocol instead of bloated XML.  Sure, it'll be harder to write.  Sure it'll be a little less flexible, but it'll be easy to maintain, easy to understand and will take fewer resources and run fast on slower (or practically any) hardware.  The experience will be a happy one and people like simple, happy, fast experiences.

I use Perl for some tasks, but if I write a script that I'll call thousands or millions of times, I start to see that Perl's startup time is a large part of the cost of my entire procedure.  Re-writing the simple task in C cuts down on the startup time (C is almost instantly doing real work after startup when Perl/python/java are still loading, parsing, allocating, etc …) and reduce the overall task my a very large margin.  This is a perfect example of programming ease versus optimal code.  I can whip out a Perl script in a minute to do something, but if I use it a lot, Perl is not the most optimal.

Another example is desktop apps versus browser apps.  I prefer desktop apps because of the speed and flexibility to the more modern browser-based apps of today.  Sure, they're great for some applications, but anything that requires any speed or major resource will still work WAY better on a user's desktop than in any browser-based application.  Let's see a browser-based AutoCAD or 3dsmax application – forget it.  It'll never happen.

I'm going to stick with the good old Unix philosophy – do one thing and do it well.  Trying to be the end-all, be-all application that is everything for everyone is not going to happen or make anyone very happy … ever … even in another 20 years when computers become even more unbelievably powerful.

Steve's picture

Pronto's micro sites go live!

39b585ee-30d4-43de-8d1f-21e3ad6cd4d0756c720e-2a99-4e64-a0f2-ca8da52308169d5576ff-8c2a-492b-b539-71d8e0701a89

Pronto's new micro sites officially went live last week.  Let's hear it for the Pronto.com team!  :)

Syndicate content