Steve's picture

Update on the GeoCities rescue from archiveteam.org

If you didn't already know, I've been trying to help Jason Scott at http://archiveteam.org to back up Yahoo's 18-year-old free hosting service http://geocities.com before they take it down for good later this summer ( http://help.yahoo.com/l/us/yahoo/geocities/geocities-05.html ). Well, everything was going well until I got capped by Comcast ( http://badcheese.com/?q=node/94 ) at 250GB of bandwidth in May 2009. So, my effort was put on hold. Jason and the rest of the gang at http://archiveteam.org has methodically run through all of the 'neighborhoods' at GeoCities already and downloaded nearly 1TB of content. There is still a lot of information in the user's directories that Jason has not downloaded yet.

Since I was using the http://archive.org crawler (Heritrix) to crawl GeoCities, I was asking a lot of questions in the forums and it turns out that the guys at archive.org were actually paying attention to my problems. So when Comcast cut me off, the guys at http://archive.org decided to help us out and do a "deep crawl" of GeoCities which started in early June. They also said that they'll do "catch up" crawls until the service closes to make sure they get any recent updates up until the day that Yahoo closes their doors.

So, thanks to Jason @ http://archiveteam.org and Gordon Mohr @ http://archive.org, GeoCities and all of it's animated gif goodness will remain on the internet until the end of time. Yea!

NOTE: We tried contacting people at Yahoo and most tape and harddrive storage companies to help us with the project and nobody even bothered to return our phone calls or emails except for the guys at http://archive.org

Steve's picture

Collage of 5000+ images taken from Amazon's iPhone app via mechanical turk

tinyturk I put together a little script on my home machine to download all of the images that people submit through Amazon’s iPhone application.  The application allows people to take a photo of anything they want to, then the photo goes to Amazon’s mechanical turk service where someone does a search for the product on Amazon’s website and returns a url to that image.  The iPhone user can then purchase the product using their phone.  Turn-around time varies, but the average from my experience with the application is about 2 minutes which I think is pretty good.

The 5000+ images were from a time period of less than two weeks and I didn’t collect all of the possible images over the two week period, just a good percentage of them.  Click here for the big image [flickr.com].

I also made a web page so people can click on each image and see the raw captured image if they’d like: turk images

Be warned, there are a couple NSFW images in there.

Not surprising stats that I noticed while looking at the images:  Lots of pictures of dogs, cats, knees, feet, shoes, faces, …  It seems like people seem to just be bored and take pictures of just about anything to see what the application will return.  Popular product images are: watches, kitchenware, remote controls and most predictably books.  Most pictures were taken in someone’s home or while out shopping.

Steve's picture

The reason to see UP - Dug the Dog

Steve's picture

I got the dreaded phone call from Comcast today

"Sir, we'd like to talk to you about your internet usage if we can."

I knew exactly what the Comcast representative was going to say. I was given a phone number that was not the 1-800-comcast typical sit-on-hold-for-thirty-minutes-and-get-a-peon-who-gives-me-the-runaround line. I got straight to a tech. I knew this was trouble.

"Sir, I'm not sure if you're aware of it, but Comcast has an acceptable use policy that states that no one user can use more than 250 gigabytes of data in one months, and it looks like you used (pause) Oh my (pause) 750 gigabytes during the month of May. If you continue next month to exceed the 250GB limit, you'll be disconnected from comcast internet for 12 months."

It turns out that this is a new-ish addition to Comcast's Acceptable Use Policy that was changed on October 1st, 2008 and there's no exceptions - I explained that I was trying to crawl geocities.com for http://archiveteam.org and for posterity and was trying to be a good netizen ( Network Citizen http://en.wikipedia.org/wiki/Netizen ) and that I didn't use all of that bandwidth to download porn or mp3s or movies, but they didn't seem to care. "No exceptions" they said.

I asked them about other account options. Comcast has a business-class line that has a 750GB bandwidth cap, but that's still really not enough to download all of geocities in a summer, much less offer it up in any sort of manor for people to grab later. I was even contacted by archive.org (the Internet archive) asking me if I could help them out with their own archive of geocities which hasn't been updated since 2001, but now I'll have to turn them down since I don't have the bandwidth to grab the site.

So, my true feelings about Comcast now? I'm even more upset at them than I was before. Comcast screwed me from day-one; over-promising me at the time of install, then under-delivering and over-charging when I got my first bill. Taking care of that took about 3 months of emails, calls and regular mail. It's getting harder and harder to get someone to actually HEAR a customer's complaint nowadays. Upper management has so many layers of defense in place to make sure that complaints get taken care of somewhere downstream that they don't seem to care anymore at all. It's like the beach-landing scene at the beginning of the movie "Saving Private Ryan". When I get off the 30-minute boat ride to the shore (on hold time), I'm shot at with every excuse that the first-level tech support guy has in his arsenal of excuses. Then escalating the call to the manager-level is relative to the sniper in the pillbox that is mowing down everyone who made it through the first line of defense. I eventually sneak up to the pillbox by sending emails, cold-calling techs and upper-management types that I can find on the internet and take out the pillbox sniper in the back of the head and get the original pricing and features that was originally promised to me by the installation/sales guy in the first place.

I even contacted the old "comcast cares" twitter account, but no response from that guy either. Nothing from anyone at all about this bandwidth cap issue.

I'm working on getting a second line now. Qwest DSL is about the same speed/price, but they have a little-documented bandwidth cap also. It's about 400GB, and when you reach it, you just have to hit "accept" on a webpage to continue internet use. No evil threats with disconnection, no extra charges, just a web page.

Another option is a hosted machine somewhere locally that I can go to and swap out harddrives with. Good bandwidth on a host, plus sneakernet back to my home to crunch the data is actually also acceptable with my current projects, so that's something that I'm working on also.

I'm also looking into the business-class comcast connection, however according to the business class Acceptable Use Policy, they state: Comcast reserves the right to suspend or terminate Service accounts where data consumption is not characteristic of a typical commercial user of the Service as determined by the company in its sole discretion, or where it exceeds published data consumption limitations. Common activities that may cause excessive data consumption in violation of this Policy include, but are not limited to, numerous or continuous bulk transfers of files and other high capacity traffic using (i) file transfer protocol (“FTP”), (ii) peer-to-peer applications, and (iii) newsgroups. so it looks like anything that would require a large amount of data-transfer, even for a business-class account, could flag the account for termination also.

Comcast seems to hold all of the cards, or so it would seem.

The reason that Comcast and other Cable/Entertainment providers are putting these bandwidth caps in place is simple. Cable companies make money on 'premium' services. These services are things like Pay-Per-View, HBO, etc ... You are all familiar with the typical bait-and-switch that cable companies provide you with when you sign up. Get tons of stuff for only $20 a month!!! Then in the fine print, you find out that $20 gives you basic cable, no DVR, no movie channels, and the $20/mo is only good for a short time. After the initial period expires, they start to slowly ream you for more and more money until you end up paying out the nose for standard TV and internet. I'm paying approx $150 for Comcast's Internet and HD DVR with basic HD service (no movie channels, no phone service). This is almost a car payment. At least with a car payment, you get the car after a few years and can stop paying! This is $150/mo FOR LIFE with nothing to show for it when you leave! I have a real tough time swallowing that one.

But where was I, oh yea. Premium services. The killer problem with the cable company's revenue model is that downloadable content is right around the corner if not here already. If people can download all of their favorite movies and TV shows directly off the internet via HULU, YouTube or any number of emerging video websites, why would anyone pay for cable TV anymore? Cable companies are a thing of the past and will die out some day. Perhaps not soon, but the writing is on the wall and Comcast and other cable companies know it. That's the real reason for the bandwidth caps. Not to stop piracy, not to keep their neighboring customers happy, but to limit LEGAL multimedia downloading that competes with their high-priced premium services. It's an anti-competition move to save their asses. Remember newspapers and Craigslist? just think Cable companies and HULU now.

So, what's the lesson learned here? I'm unable to build a startup cheaply that competes with largely-funded companies due to bandwidth caps and low upload speeds (I can't crawl the web, I can't store large chunks of data, I can't host a high-volume website) from my home internet connection, so I need a business-class connection (which still has a cap, is more expensive, still not unlimited use, ...) or better (hosted machine in a datacenter (actually the best solution for my needs)) from a company that doesn't have a competing interest in the Entertainment industry. I need to get away from Comcast. Comcast has some really upset and pissed off users out there and I don't see it getting any better any time soon. The more popular online video sites become, the more Comcast is going to clamp-down on internet usage. For my projects, I can't have that. I'm going a different direction. Sorry Comcast. You lose in the end.

Steve's picture

Why making a startup out of your basement is so difficult

I like building startups.
Sure, I could make a blog about my dog's favorite chew toy.
Sure, I could make another url-shortening service.
Sure, I could make a mashup of something that involves CraigsList, eBay, Google maps, Hulu and flickr.

However, that crap is boring and I prefer to think bigger. I always like to dream about putting something together that will really make a difference. I would love to take on the big guys and beat them at their own game. However, there are more and more stumbling blocks to do this as I try more and more different things out of my basement with my shoestring budget:

  • ISP bandwidth caps - I can get 7Mbps downstream from my Comcast ISP, but I can't crawl any large sites due to Comcast's 200GB/mo bandwidth cap, and don't even get me started with the upstream bandwidth ...
  • Storage/indexing - I can buy a few 1TB hard drives every once in a while and stay within my startup shoestring budget, but getting mysql or lucene to index terabytes of content is not a simple solution anymore. I've got a basement with 8 old-ish computers of varying capacities that are pretty busy on a regular basis with miscellaneous things to do. Keeping on top of a mountain-sized chunk of data is not an easy task anymore.
  • Google is huge. I can't compete with Google at crawling/indexing/search. Nobody really can. However, I could choose a subset of Google's empire and focus on it and build a better mousetrap and win, just choosing an interesting nitch to hack on is a difficult task in itself. To me, Google's adsense/adwords system seems the most profitable and interesting target that I'd love to build a competitor to. Troy and I built http://bidboxr.com in an effort to experiment with the online ad space and we learned a lot in the process, but we didn't go the extra distance and take on the big-G at their own game. One of our other ideas http://mediawombat.com a flash search engine proved to be a valuable experiment, but again, my shoestring budget kept me from really hitting this one home.
  • People/Time - I'm a husband and father of two wonderful children. I find it difficult to go to the bathroom without being interrupted by someone or something nowadays. Finding spare time is becoming a difficult thing to do in itself. Most of my development time is done in the wee hours of the night when I'm low on energy, but finally have some free time to myself. I tend to try to work out the details in my head over a series of days or weeks. Take notes about possible solutions and new directions, then think about that. When I think that my brain has slept on the problem enough nights, I can usually whip out a solution in code-form in an hour or two. This is my current way of building things. The actual work has been delayed a lot longer than it was when I was single or in college and could just pound on the keyboard for 24 hours in a row until it worked. I'm not sure if this new way is better or worse, but it fits in with my lifestyle a little bit more.
  • I'm trying to learn about SEO. It seems to me that SEO is an always-changing and challenging market that is profitable and un-tapped on several fronts. I've got a few experiments going with an SEO twist mostly so I can learn about the whole SEO world, but SEO takes a long time - search engines take a long time to index your content, so changes are only reflected after a long period of time. My SEO experiment is http://xis.cc and is doing well with Yahoo, but Google doesn't like it very much.

Anyway, that's all I have for now. More later. :)

Steve's picture

Table Mesa & Flatirons Summer and Snow




Table Mesa & Flatirons

Originally uploaded by notanyron

This is a great shot and pseudo-typical Boulder weather. A cross between Winter and Summer in the same shot. :)

Steve's picture

Apache tuning for small-ish linux machines

I started with a dedicated web server running on a 256MB Linux machine with a single-core.  It's the machine that's hosting this website right now.  I've had some very good experiences with this machine and some not-so-good.  I've upgraded the memory to 512MB, but still finding myself stretching for resources.  Also, apache seemed to crash on occasion and I kept fighting with it over and over again to provide good response times and still tune for low-memory.  I found several issues that I'd like to mention in case others are having similar issues.  Mysql is taking 200Mb for a key buffer, the OS takes approx 100MB, so that leaves about 200MB for Apache/PHP.  The solution is to not keep servers hanging around processing unlimited keepalive requests.  The setting "MaxRequestsPerChild" forces apache to respawn children after a certain number of requests have been processed.  This keeps apache and PHP from dying - if PHP has a bug and hangs, then PHP will be broken, but apache will continue to serve static content until it reaches the limit of this value, then it'll respawn and PHP will be all better again.  This is not optimal for a mega-busy webserver, but it's a good 'stable' configuration for a medium-busy web server like mine.  I host 40+ medium-to-small websites on this single machine using these settings.

Operating system: Fedora FC6, Apache: 2.2.6, PHP: 5.1.6

Apache settings:

Timeout 5

KeepAlive On

MaxKeepAliveRequests 200

KeepAliveTimeout 3

<IfModule prefork.c>
StartServers       1
MinSpareServers    1
MaxSpareServers    5
MaxClients         25
MaxRequestsPerChild  500
</IfModule>

This may not seem all that important, but it took a while to hone these to make apache work just right under low-memory conditions and possibly buggy php and still continue to live on a busy machine and serve-up a bunch of content without any noticeable issues.  If you've got a machine with a similar setup to mine, try out this apache config and let me know how it goes.

Steve's picture

Got rejected by TechStars again, but we're getting better at it

wombat Last year we applied to TechStars 2008 for our website http://mediawombat.com (a search engine that indexes the contents of flash media (*.swf files)) and were rejected.  That was our first rejection and stung a little bit.  Much like Micah, we were alreadytechstars-logo-4c dreaming about the fast servers and huge pipes that we could afford with the seed money and were looking forward to doing something huge.

This year, we came up with another concept and put it together.  The new idea is http://bidboxr.com (a combination of ebay and adsense – where you would put your auctions on other peoples’ websites as ads) and got rejected by DreamIT Ventures before the deadline for TechStars (they told us that competing with ebay was crazy).  I attended TS4AD in March and I used all of the tips and tricks that I learned there to pitch our site, our team and keep TechStars in the loop (like they requested) while we worked on our site some more.  We were sent an email on March 30th saying that we had made the top-50 companies and that TechStars will notify us in two weeks if we made the cut or not.  We were a little happier, but tried to keep the pessimistic attitude.  The chances had gone from 1/200 to 1/3, so we were feeling lucky, but it wasn’t a done-deal yet.

bidboxr_logo_20This morning (April 13) we got our TechStars 2009 rejection letter, but it didn’t sting as much as the earlier ones did.  I was keeping pessimistic about the whole thing just in case (there were 500+ applicants to TechStars this year after all) and we already have small investors expressing interest in our site and the site is up and running without hardly any costs to keep it running, so we didn’t really need TechStars all that much to begin with.

ablock_logoSo, in an effort to not sound like one of those American Idol rejects shown on the first few episodes of every season who say things like, “@#$% you Simon!  I’m going to make it big without you and your stupid show”.  For those of us who didn’t get into YC or TS this year (Micah), don’t worry.  I think that I’ve figured out how to make it (not big, but get some traction at least) without the use of seed investors.

I’m a technical guy.  I write code, get websites up and running and deal with the computer/technology piece of the puzzle.  I don’t know anything about pounding the pavement and cold-calling people to see if they’d be interested in our technology.  The thing that I have going for me is that my partner is also a technical guy, but has a background in sales!  He understands the technology (he created a lot of it) and also does a fantastic job of getting people to come and look at our site and sit down with us to discuss possible business opportunities.

TechStars suggests that 2/3 to 3/4 of your team should be technical and I agree.  Having a team that can whip out tons of code out in a very short period of time is essential to making major changes to your site in the 3 months of TechStars and keeping your initial customers happy.  However, if you don’t make it into a seed program, my suggestion is that you add a little more heft into your sales force to do some of the footwork that the seed programs may have helped you out with and also change your thinking to more like a penniless startup (read on).

When I attended TS4AD in March, I was really impressed by one speaker in particular.  He was the CEO(?) of http://dailycandy.com  and talked about starting a company in the post-9/11 NYC economy where there was no money.  His first server was a computer sitting under someone’s desk and the whole website operated on a business-class DSL line that they paid < $100/month for.  They didn’t go the VC route.  His rule of thumb was, “Don’t spend a dollar until you have two”.  I wrote that quote down.  They focused on one thing at a time and did email marketing – it was free and software was hand-written, so costs were essentially nothing.  The company slowly grew over time and so did the product, but by keeping the overhead low (no copy machine, every employee assembled their own desks and chairs, etc.), the company was able to maintain the startup culture as long as possible in tough economic times and survive.  His talk was great and left me feeling like there was a second option that was available to me as a founder of a startup that I hadn’t even though of before.  For a short while, we even considered pulling our application out of TechStars to run the way of the cheap-o startup, but we didn’t pull out our application – just in case.

So, in short, we (and you guys too) really don’t need a seed program (Take that, Simon!).  We had made the top-50 at TS and missed getting selected by a very slim margin.  This gives us some good confidence and tells us that our idea has some real merit.  We’re currently very fueled by this notion and using that energy to move forward with our expansion plans on our own.  Just this morning, we were contacted by a company who is willing to list over 5000 products on our new site and it’s making us feel even better about our situation!

For those of you who were rejected by TechStars, DreamIT, Y-Combinator, …  We feel your pain and need to keep our collective chins up and noses to the grindstone.  Getting rejected is not the end of your startup, it’s only the beginning.  Get out there and ring some doorbells.  Take a small business owner out to lunch and enjoy being an entrepreneur!

Oh, and next year – we’ll probably do it all over again.  :)

Syndicate content