Steve's picture

A little experiment

Ok. I've started a little experiment. Knowing what I know about SEO (not a ton, but some), I'm wondering if I can capture some kind of SEO-based randomly descent placement in google's results. Here's the plan: I capture google "trends" terms every hour. Then, I pipe those terms through all kinds of news sites and extract the top hits via yahoo pipes. I grab the RSS results from the pipe and create a web page using those results. So, I have popular queries, good-ish content, and do all of the things right (as far as I know) to get them indexed as fast as possible (still not great at that). However, the site is chock full of content. Sitemap is updated hourly and downloaded by google every 24 hours (trying to get faster fetching than that), and google's crawler is hitting my site every 2 seconds almost all day long fetching new links and pages. Pages have titles, keywords and should be very seo-friendly overall. It's been about 2 weeks since I started the project and it is really interesting watching google index my site and rank them accordingly. I think that 2 weeks is not enough time to say one way or the other how good the experiment is going since the pages don't actually enter google's real index for about 4 weeks. I'll post more when things get interesting. :) The site: http://xis.cc
Steve's picture

So, with Woz's legions of geek/hacker fans, how is he not going to automatically win Dancing With The Stars via a python script?

stevewozniak800 Ok, so I know that Steve Wozniak (one of my childhood heroes) has legions of Apple fanboys and geek hackers under his belt that he can call to action in a moment’s time, right? Voting can be done via the Internet with any email address (doesn’t even require registration) for a couple of hours every night. My question is, where’s the hacker that posts the automated python script to get a fake email address, auto-register at abc.com and mass-vote for Woz on Dancing For The Stars? I can’t believe that this hasn’t already been done. I, myself tried to put something together in Perl quickly on Monday night, but I didn’t have time (family stuff interrupted). So, hacker nerds … Where is it? I’ll be the first one to start it up and get it running on my systems, so let’s get it going! :)

Steve's picture

Denver DIA wireless is free, but completely broken

So, I’ve got 2 hours to wait before my flight leaves for JFK today, so I open up my laptop to change my email to vacation mode and read some news, etc …  Linux connects to the WAP, but fails to get any DNS info from the server, so I reboot into WinXP.  XP detects the WAP with “excellent” connection status, but getting the ‘portal’ server to actually serve any information is almost completely useless.  The server serves up a 30 second commercial before it allows the user to get access to the internet, but I’m guessing from the speed of things (now about 45 minutes since I started trying to get access to the internet, when I decided to whip out my offline BLOG writer application), the server is busy serving up the ad, to so many people that the server is completely unresponsive.  The actual wireless strength of the signal is great, it’s just that whoever they chose as the ISP (http://freefinet.rtr ???) does a completely crappy job at actually doing the connection redirection.  Blech!  Sure, it’s free wireless, that you have to hit “accept” for twice, and watch a 30-second video that never shows up, but come on!!!  Ever heard about caching?  Proxying?  Load-balancing?  Or how about this, if a connection completely fails after about 30 minutes of attempts, how about just lifting the proxying crap and just allowing normal internet in case your stupid server can’t handle the load???  In by book, DIA wireless is completely useless to me.  This is both the fault of the ISP and DIA.  DIA should give the ISP 2 days to fix the problem or drop them like a hot rock.  I'm sure that there are thousands of ISPs in Denver willing to provide the access hardware for one free ad.

Another reason that relying on "the cloud" is a bad idea.

FYI: I was able to get access to the internet after 63 minutes of attempting to get through the ISP redirection/bullshit.

browser

Steve's picture

Dell's new packaging

Just got in some rails from DELL. Looks like they also altered their packaging standards.

- Steve

Steve's picture

mysqlgame is my kind of game!

Ok, no fancy graphics.  No sound effects.  No FPS actually.  :)  Check out mysqlgame here: http://mysqlgame.appspot.com

image

Steve's picture

Had to post this.

sarah

Sarah was flashing some smiles last night during dinner, so I had to put this picture out there.

I’m so proud.  :)

Steve's picture

Gonna play with Nutch tonight

Gonna play with Nutch tonight as a possible replacement for my own personal web crawler.  Nutch is a java-based web crawler that we may implement for our large crawling process, but not for our ripping/indexing layer.

Followup:

Why I'm not going to be going with Nutch:

fetching http://www.everydaybirthday.com/
fetching http://welcome.hp.com/gms/gr/el/sz3/smb/notebooks_tabletpcs.html
fetching http://boomp3.com/listen/fbnoc45_p/am-gold-1970-04-your-song-elton-john
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

Steve's picture

HOWTO: diff'ing two huge files in linux

Situation: I have two computers with a large number of files on them (approximately 250 million files on each machine). I need to sync them up and rsync is not an option because it takes way too long. So, I need to 'diff' the files on the two machines.
I did a 'find' on both machines to a file. These files turned out to be about 15GB each, but the file size was too large to just 'diff' because diff wants to read everything into memory:

[root@fs105 tmp]# diff imagelist_image01.txt imagelist.txt
diff: memory exhausted

Solution: sort the files manually first, then use the 'comm' command to find the differences

[root@fs105 tmp]# ls -la
total 27935834
drwxr-xr-x 2 root root 120 Oct 6 10:18 .
drwxr-xr-x 8 root root 192 Oct 3 13:56 ..
-rw-r--r-- 1 root root 13859131915 Oct 6 10:13 imagelist_image01.txt
-rw-r--r-- 1 root root 14719246513 Oct 3 14:02 imagelist.txt
[root@fs105 tmp]# sort -S 2G -T . imagelist.txt > imagelist_image02_sorted.txt ; sort -S 2G -T . imagelist_image01.txt > imagelist_image01_sorted.txt
[root@fs105 tmp]# comm -3 imagelist_image01_sorted.txt imagelist_image02_sorted.txt > diff.txt
[root@fs105 tmp]# ls -lah ; wc -l diff.txt
total 55G
drwxr-xr-x 2 root root 240 Oct 6 14:16 .
drwxr-xr-x 8 root root 192 Oct 3 13:56 ..
-rw-r--r-- 1 root root 895M Oct 6 15:09 diff.txt
-rw-r--r-- 1 root root 13G Oct 6 14:11 imagelist_image01_sorted.txt
-rw-r--r-- 1 root root 13G Oct 6 10:13 imagelist_image01.txt
-rw-r--r-- 1 root root 14G Oct 6 12:25 imagelist_image02_sorted.txt
-rw-r--r-- 1 root root 14G Oct 3 14:02 imagelist.txt
17487092 diff.txt

Syndicate content