Author Topic: Archive.org  (Read 3758 times)

Jetfire99

  • Lieutenant
  • ***
  • Posts: 92
Archive.org
« on: October 15, 2012, 09:30:18 PM »
A friend who doesn't play COH heard of our plight and one thing lead to enough restulting in this.

First off I apologize that this isn't a proper advertisement since it isn't for another mush. However this is more something concerning a MMO community that I feel deserves a little spotlighting.

 I emailed Jason Scott of Archive.Org to see if he had any words of wisdom about NCSoft's descision to shut down City of Heroes since Mr. Scott and the Archive Team are responsible for backing up most of Geocities Friendster, LuLu Poetry, Yahoo Video, and in general are awesome people for not letting things die and occasionally being able to shame companies (even Google) into taking a step back from the slash and burn approach.

His Response:
OK, so first - we downloaded the forums. 218 gigabytes compressed! Crazy!
Then we downloaded the main website. 3.6gb. Not so big, but still, something.

http://archive.org/details/archiveteam-city-of-heroes-www
http://archive.org/details/archiveteam-city-of-heroes-main

I made sure that as many pissed off CoH users know about it, and maybe


He asked me to keep his name out but hey we got more friends in oddball places. They also plan to archive near the end as well. I know our own efforts but this means we have another backup of everything. :)

Victoria Victrix

  • Team Wildcard
  • Elite Boss
  • *****
  • Posts: 1,886
  • If you don't try, you have failed.
    • Mercedes Lackey
Re: Archive.org
« Reply #1 on: October 15, 2012, 11:54:37 PM »
I've already written to thank him.
I will go down with this ship.  I won't put my hands up in surrender.  There will be no white flag above my door.  I'm in love, and always will be.  Dido

Jetfire99

  • Lieutenant
  • ***
  • Posts: 92
Re: Archive.org
« Reply #2 on: October 16, 2012, 03:26:52 AM »
Awesome to hear that.

lobster

  • Underling
  • *
  • Posts: 6
Re: Archive.org
« Reply #3 on: October 16, 2012, 04:43:43 AM »
Awesome.  Although celebrating that makes me even more bummed as it shifts other things towards "acceptance".

NecrotechMaster

  • Elite Boss
  • *****
  • Posts: 388
  • is there a badge for that?
Re: Archive.org
« Reply #4 on: October 16, 2012, 05:21:07 AM »
this will be a good thing to help rebuild the forums once the official forums are shut off on nov 30th

Lycantropus

  • Elite Boss
  • *****
  • Posts: 255
Re: Archive.org
« Reply #5 on: October 16, 2012, 07:29:28 AM »
I have no skills of measurable worth in any of the endevours thus far. Thank you for finding yet another way of preserving our City of Heroes! The forums were an invaluable resource to touching on the pulse of what we wanted, and much of it was stuff we could say with little fear that few other MMO's would even begin to ask. This is important. From the direct game experience, to the interests of its players, the CoH forums were a rich resource.

For me, the forums were one of the few regular spots I stopped to check on new things, and there were many names I knew even though I lurked. Being able to look back at least, is a worthwhile pursuit. Especially before they tried to stop talk of other games in the same genre. I could see why they did it, but they had nothing to fear from any competitor, only something to learn (when they did something better- which was less common than expected). Makes me wonder where the order for that came from... but that's another story.

Lobster, it's one thing to accept that one thing is coming to a close, but it's just as important to do it without forgetting the lessons it's taught us, and what we carry away with it (even in terms of a sense of history). A record of what has passed... somewhere... is of value. We gain nothing by losing something that was worthwhile. Future MMO designers could learn a thing or two from what's been posted in our forums (heck even those that stayed or went to Cryptic could learn a thing or two about communication and teaming mechanics from what City of * can produce to this day and they were people that seemed to miss some fundamental things from doing that!!!)

I've had a growing concern in today's disposable culture that more and more things will be forgotten as the 'next new thing' rears its head, I at least find comfort in knowing that something is being done to preserve what came before. It's some comfort to see that some group is trying to keep records of what has (or will soon) pass.

Thanks for the link to the site, at any rate. Glad to hear they're actually keeping the records. I'll be keeping a link to it.

Lest we forget.

Graphite

  • Lieutenant
  • ***
  • Posts: 89
Re: Archive.org
« Reply #6 on: October 16, 2012, 04:32:24 PM »
Awsome

/em thumbup

Mantic

  • Boss
  • ****
  • Posts: 172
Re: Archive.org
« Reply #7 on: October 17, 2012, 12:56:58 AM »
Great that the spider was sicced on it. It still might help to trawl the more important pages with Archive.org's liveweb, since the spider doesn't always seem to be 100%.

Archive.org also has one big problem: if anyone ever takes control of the domain and puts up a spiders.txt, Archive.org by default respects that document above all past archival records. I don't know if past records get wiped or just blocked from public access, but the net result is the same. Cybersquatters and other meanies have caused the loss of many of the oldest sites Archive.org once preserved that way.

Little David

  • Boss
  • ****
  • Posts: 149
    • The Ad Ultimum Network
Re: Archive.org
« Reply #8 on: November 29, 2012, 08:07:06 PM »
Great that the spider was sicced on it. It still might help to trawl the more important pages with Archive.org's liveweb, since the spider doesn't always seem to be 100%.

Archive.org also has one big problem: if anyone ever takes control of the domain and puts up a spiders.txt, Archive.org by default respects that document above all past archival records. I don't know if past records get wiped or just blocked from public access, but the net result is the same. Cybersquatters and other meanies have caused the loss of many of the oldest sites Archive.org once preserved that way.

The Archive Team is different from the Wayback Machine's spider. They don't use the same tools to do the job; they actively have scripts that go through and download specific stuff. They use a program of their own design, Archive Team Warrior as well as stuff like WGET.

Also, since this is currently being offered as a downloadable archive on Archive.org, it's separate from the Wayback Machine anyhow; if/when a squatter takes over City of Heroes' domain name and puts up that robots.txt, the archive of the CoH forums will still be available as a download.

I do recall that the Archive Team wants to make their archives integrated into the Wayback Machine, as the actual spider that Archive.org uses is very incomplete. Jason Scott once likened it to somebody saving photos of a house before it was burned down, rather than the actual stuff inside.

Malohin

  • Underling
  • *
  • Posts: 17
  • Excellent!
Re: Archive.org
« Reply #9 on: November 30, 2012, 02:13:23 AM »
You can do this on your very own, too. I've been trolling the forums by individual posts. I can't quite get this to use my login credentials, but the public stuff is -- public. :)

Code: [Select]
mkdir -p log
/usr/bin/seq 5 | parallel wget --joblog $0.log --resume -e robots=off --user-agent=lugnutz --page-requisites '--append-output=log/single_post-{}' --html-extension 'boards.cityofheroes.com/showpost.php\?p={}'

Code: [Select]
/use/bin/seq 5      ### Please count from one to five
|                   ### Send that list to
parallel            ### the parallel program, to run a bunch of programs in parallel, that program being
wget                ### the wget web crawler
--joblog $0.log     ### list of completed jobs
--resume            ### use the joblog so we don't repeat tasks
-e robots=off       ### Ignore polite requests to stay out of things
-user-agent=lugnutz ### lie about who we are, in case they are blocking the wget program
--page-requisites   ### Get everything the page needs
'--append-output=log/single_post-{}'
                    ### make a copy of what happened here ( the {} is where the number goes single_post-1, single_post-2, etc. )
--html-extension    ### Force the page to actually have an HTML extension, we can can open it directly in a browser
'boards.cityofheroes.com/showpost.php\?p={}'
                    ### This is the page we want -- {} is the number again

If someone with way too many cores and a big enough pipe want to run this, they could...

Note this doesn't preserve threads, just the individual articles, and only the public ones. If someone has some clue why my cookies file isn't working, that would be grand but that only means I'm not getting the beta and other 'private' forums I have access to.
--
Malohin

Edit: Added --joblog --resume
« Last Edit: November 30, 2012, 09:33:11 PM by Malohin »

Flashtoo

  • Minion
  • **
  • Posts: 46
Re: Archive.org
« Reply #10 on: November 30, 2012, 04:09:19 PM »
This includes the comic book pages?

Malohin

  • Underling
  • *
  • Posts: 17
  • Excellent!
Re: Archive.org
« Reply #11 on: November 30, 2012, 09:29:55 PM »
This includes the comic book pages?
Not quite sure what the question is, Plangkye. This will grab a specific, single page if that page can be viewed by a non-logged in visitor.

If you can find the page you want, and get the post number, you can test this yourself on a machine with the wget program installed. If the post number is 123456 this is what the command would look like:

Code: [Select]
wget -e robots=off --user-agent=lugnutz --page-requisites '--append-output=log/single_post-123456' --html-extension 'boards.cityofheroes.com/showpost.php\?p=123456'
Post Number
If you go to this thread, you'll see there is one post. Over on the right, there is a little '#1' which is a link to just that post. In this case, the URL for that single post is:

Code: [Select]
http://boards.cityofheroes.com/showpost.php?p=4428406&postcount=1
You can see the post number is 4428406. If you wanted to grab just this post, you could use:

Code: [Select]
wget --page-requisites --html-extension 'boards.cityofheroes.com/showpost.php\?p=4428406'
In general, it looks like the guys from Archive.org have this covered (see the first post in this thread) but I'm paranoid and wanted to copy off some pages for my own use. :)

The Fifth Horseman

  • Elite Boss
  • *****
  • Posts: 961
  • Outside known realities.
Re: Archive.org
« Reply #12 on: November 30, 2012, 10:37:37 PM »
Cross-posting this here.
If you have something capable of batch download, use this URL pattern:

http://boards.cityofheroes.com/printthread.php?t=<thread_num>&pp=500&page=<page_num>
This will grab threads from the archive - lighter on layout, more posts displayed per page. Thread numbers go up to around 298900 at this time, and you'll need to download one page for every 500 posts in the thread, rounding up.
We were heroes. We were villains. At the end of the world we all fought as one. It's what we did that defines us.
The end occurred pretty much as we predicted: all servers redlining until midnight... and then no servers to go around.

Somewhere beyond time and space, if you look hard you might find a flash of silver trailing crimson: a lone lost Spartan on his way home.