Titan Network

Archive => Save Paragon City! => Topic started by: Chaos Ex Machina on September 22, 2012, 03:59:10 AM

Title: Can we start a board mirroring project?
Post by: Chaos Ex Machina on September 22, 2012, 03:59:10 AM
With the Zwillinger news I expect the official boards may go rather quickly.  Please act so the board is not GONE TO THE AMERICANS.

I am ridiculously busy and involved in another project I intend to announce soon, but board parsing should not be a particularly tech intensive task if any of you are willing.
Title: Re: Can we start a board mirroring project?
Post by: uninventive on September 22, 2012, 04:02:01 AM
Yeah, I'm thinking along the same lines, too: the fewer red names present, the earlier they'll just axe the boards to avoid liability (player makes racist/hate speech comments, they linger, lawyers pirouette, peasants get lulz, and NCSoft closes them anyway).
Title: Re: Can we start a board mirroring project?
Post by: SithRose on September 22, 2012, 05:21:55 AM
I believe several people are already working on this - the thread may have dropped to the third page or so, but it's here.
Title: Re: Can we start a board mirroring project?
Post by: Chaos Ex Machina on September 22, 2012, 10:13:59 PM
I saw a lot of discussion but no commitments.
Title: Re: Can we start a board mirroring project?
Post by: WanderingAries on September 23, 2012, 05:37:53 AM
I saw links, discussion, and concepts, but no solid confirmations.
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on September 23, 2012, 06:02:07 PM
I'm doing some mirroring/archiving, but I would certainly encourage anyone else who wants to do so as well. I do not have what I downloaded available online anywhere yet though.

I have three archives:

The first archive is everything except the forums. So that would include pretty much anything at *.cityofheroes.com that can be found through recursively linking from the main website, as well as content on a few other related domains. So this includes domains such as ftp.coh.com, goingrogue.na.cityofheroes.com, ftp.ncsoft.com. I last took a snapshot around September 10 but I may take a new snapshot again later this week. This archive is more than 20 GB in size, due to media files.

The second archive is just the forums. My initial attempts to archive this met with some difficulties. Because it's a dynamic site with many ways of viewing the same information, you end up with many many copies of the same information. And because it's all "flat" it was making my computer cry to have so many files in a single directory. However, this thread got me investigating again and I discovered that there's an archive friendly (http://boards.cityofheroes.com/archive/index.php) version of the forums that strips out links, images, formatting, etc. Now that is great for archiving! So last night I kicked off an archive job and it downloaded successfully, though it stopped at I think 1,000,000 links (because I didn't realize that was a default). I now have it set to download with a much higher threshold. I don't know how long it will take to finish, but I'll try to make sure I get at least one full snapshot. Not sure how big it will end up being, but the partial download was 1 GB.

The third archive is that I downloaded all the videos from their Twitch.TV, Ustream, and YouTube accounts. This is nearly 60GB of data.

I'm also keeping the archives I make backed up on SpiderOak (https://spideroak.com/) so I'm reasonably confident I won't lose them due to drive failure or anything. However, as I said, I wouldn't want to discourage anyone from making their own archives; more copies is safer. I'm using HTTrack (http://www.httrack.com/) to make my website archives.
Title: Re: Can we start a board mirroring project?
Post by: Chaos Ex Machina on September 23, 2012, 06:23:30 PM
There may be a way to specifically archive vbulletins in a way that may convert to a database however I found none.

You could filter URLs to ignore certain types of categories.  For example only capture the indexes, postings, and profiles.
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on September 23, 2012, 08:04:29 PM
Quote from: Chaos Ex Machina on September 23, 2012, 06:23:30 PM
You could filter URLs to ignore certain types of categories.  For example only capture the indexes, postings, and profiles.

That was actually what I was starting to do yesterday when I started, until I noticed the archive version of the site. The archive version gives us all the post contents and forum indexes, so I figure that's the highest priority to capture at least initially. Maybe once I have a solid archive of that, I may try to expand to the "normal" version of the forums to capture images and formatting, though I have a suspicion that'll blow the size of the archive up quite significantly.
Title: Re: Can we start a board mirroring project?
Post by: Windy on September 24, 2012, 12:52:19 AM
Quote from: Sekoia on September 23, 2012, 06:02:07 PM
I'm doing some mirroring/archiving, but I would certainly encourage anyone else who wants to do so as well. I do not have what I downloaded available online anywhere yet though.

Sekoia, will your archives of the forums be available for those of us who may want to search them for info? (For future articles, quotes, contact info, etc.)

Archiving a forum without server access is not something we know how to do in my household.  If someone wants to give me a few instructions, I'm happy to be another archiver.
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on September 24, 2012, 01:16:45 AM
Quote from: Windy on September 24, 2012, 12:52:19 AM
Sekoia, will your archives of the forums be available for those of us who may want to search them for info? (For future articles, quotes, contact info, etc.)

I don't have any specific plans for how to make them available just yet, but when the official sites go offline I'll work with Titan to see if there's some way we can make them publicly available. I'd definitely like to see that happen.

Quote from: Windy on September 24, 2012, 12:52:19 AM
Archiving a forum without server access is not something we know how to do in my household.  If someone wants to give me a few instructions, I'm happy to be another archiver.

Here's the software I used for it: http://www.httrack.com/ I'm currently running an archival job so I can't check to see what settings I used specifically right now, I'll try to share them later though. The software isn't too terribly hard to learn, especially if you read through their documentation (http://www.httrack.com/html/index.html).


UPDATE: I managed to finish downloading the archive version of the forums. It totals about 4.5 GB. Note that this also only includes the publicly viewable parts of the forums, not the stuff that requires log-in.
Title: Re: Can we start a board mirroring project?
Post by: Quinch on October 01, 2012, 07:22:12 AM
The problem I've found with website scrapers and bulletin boards is that they treat every link as something to be downloaded - even when the link leads to the post in the same thread, leading to a veritable and perpetual explosion.

I'm thinking of putting something together that will simply go page by page and download thread by thread, but as I'm still getting the hang of C#, I know I can do it, but not sure if I can do it quickly enough.
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 01, 2012, 01:39:29 PM
Quote from: Quinch on October 01, 2012, 07:22:12 AM
The problem I've found with website scrapers and bulletin boards is that they treat every link as something to be downloaded - even when the link leads to the post in the same thread, leading to a veritable and perpetual explosion.

Yeah, I gave up my first attempts at mirroring the forums because of exactly that. Perhaps with filters it could be made more reasonable, though.

The archive version of the forums only represents each post exactly once, so the issue you cite fortunately wasn't a problem there. Unfortunately, the archive version is far from perfect. It strips off formatting and drops images.

Quote from: cmgangrel on October 01, 2012, 11:43:42 AM
Which means that the Beta testing boards are not gathered (ie the ones you see if you log in on a VIP account)

This is correct.
Title: Re: Can we start a board mirroring project?
Post by: voodoogirl on October 01, 2012, 01:53:59 PM
What about Closed Beta forums? I still have "subscriptions" to the one or two I was in.
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 01, 2012, 01:58:59 PM
My current archive only contains the publicly-viewable parts of the forums, so that excludes VIP and Beta stuff.

I'm now making an attempt to archive the VIP section (I found out how to have WinHTTrack log in), but unfortunately I don't have access to any Beta forums that might exist so I wouldn't be able to archive them. I'll post again when the mirroring job is complete to note whether it successfully captured the VIP stuff.
Title: Re: Can we start a board mirroring project?
Post by: voodoogirl on October 01, 2012, 02:02:20 PM
I can tell you that http://boards.cityofheroes.com/forumdisplay.php?f=726  leads to the main I-20 Pre-Beta forums, if you can download it
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 01, 2012, 02:08:00 PM
Nope, tells me I don't have permission to access it (even though I'm logged in). :(
Title: Re: Can we start a board mirroring project?
Post by: voodoogirl on October 01, 2012, 02:10:21 PM
Maybe you can ask a former Dev who still has forum log in privileges...?
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 01, 2012, 02:14:22 PM
Paying closer attention and to clarify, the VIP forums I have access to are actually for Issue 24 beta (for some reason the fact that they were both VIP and Beta wasn't clicking for me...):
Issue 24: Resurgence [VIP Beta] Forums
- Issue 24: Resurgence [VIP Beta] Announcements Forum
- Issue 24: Resurgence [VIP Beta] General Discussion Forum
- Issue 24: Resurgence [VIP Beta] Feedback Forum
- Issue 24: Resurgence [VIP Beta] Bug Reports Forum

Are the previous issues (such as I-20) beta forums still actually there? I was under the impression they nuked them after a while.

I honestly don't follow the official forums very much, so sorry if that sounds clueless. :)
Title: Re: Can we start a board mirroring project?
Post by: voodoogirl on October 01, 2012, 02:19:22 PM
Nope, they are there. I can still see all the contents of http://boards.cityofheroes.com/forumdisplay.php?f=734

Maybe parent directories are hidden but not children?

A few months ago I discovered the ascending order of new forums and found an empty forum and posted in it, wreaking some havoc for one night. The thread was excised the next morning and the phantom forum locked.

AFAIK they don't erase the forums - they make them invisible - since they sometimes refer back to the threads.
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 01, 2012, 02:50:50 PM
Interesting! Once I confirm that HTTrack is actually working properly for the log-in sections, I'll send someone a PM and inquire about the beta forums then. Doesn't hurt to ask. :)
Title: Re: Can we start a board mirroring project?
Post by: Dr Toerag on October 01, 2012, 05:14:15 PM
http://wayback.archive.org/web/*/http://boards.cityofheroes.com

I think that this site has already archived the forums a few times over the years (going back to 2004, with the latest being quite recent).
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 01, 2012, 05:23:06 PM
The problem with the Wayback machine is that you won't know if they captured content for a very long time. Their most recent capture of the official forums is July 2011, which is over a year ago. Plus a lot of their copies are are only partial -- I keep getting on their "liveweb" interface when browsing their archive. So they're good for what they already have, but I don't want to depend on them for being complete and up-to-date.
Title: Re: Can we start a board mirroring project?
Post by: Dr Toerag on October 01, 2012, 05:28:06 PM
I KNEW it was too good to be true :(.
Title: Re: Can we start a board mirroring project?
Post by: voodoogirl on October 01, 2012, 05:32:58 PM
Maybe we can ask Zwill to unsecret a lot of the old CB boards?
Title: Re: Can we start a board mirroring project?
Post by: dwturducken on October 01, 2012, 10:27:11 PM
I was part of Issue 18 closed beta, as well as one of the ones prior, but I can't remember if it was immediately prior or something like Issue 16 or even 15.  If anyone has a link to the Issue 18 closed beta to try, I'd be happy to get that one.  Maybe we farm out the closed beta forums? Put out a Call?
Title: Re: Can we start a board mirroring project?
Post by: Sekoia on October 02, 2012, 01:35:31 PM
Just an update -- my attempt to mirror the I24 VIP Beta forums failed. I guess the cookies didn't work right. Of course, the official forums have a tendency to act funny with their cookies anyway, so it's not a big surprise to me. I'll try making another attempt at some point in the next few days, hopefully I'm just doing it wrong and can figure it out.

So at this point, I still don't have a way to archive anything behind a login.
Title: Re: Can we start a board mirroring project?
Post by: malonkey1 on January 09, 2013, 04:29:35 PM
Could you link us to what you have?
Title: Re: Can we start a board mirroring project?
Post by: Osborn on January 10, 2013, 08:57:12 PM
Quote from: Sekoia on October 01, 2012, 05:23:06 PM
The problem with the Wayback machine is that you won't know if they captured content for a very long time. Their most recent capture of the official forums is July 2011, which is over a year ago. Plus a lot of their copies are are only partial -- I keep getting on their "liveweb" interface when browsing their archive. So they're good for what they already have, but I don't want to depend on them for being complete and up-to-date.

Not to mention Wayback machine won't always mirror actual threads, just the main page. So you can get a month by month snapshot of what the last threads were updated that day or how many replies a topic got, but that's it.
Title: Re: Can we start a board mirroring project?
Post by: The Fifth Horseman on January 10, 2013, 10:31:30 PM
IIRC they made a complete backup of the forums around November, but it's not part of Wayback Machine.
Title: Re: Can we start a board mirroring project?
Post by: Aggelakis on January 11, 2013, 06:20:08 AM
http://archive.org/details/archiveteam-city-of-heroes-main

From Sept 2012.
Title: Re: Can we start a board mirroring project?
Post by: malonkey1 on January 11, 2013, 07:51:38 PM
Awesome. Thanks.
Title: Re: Can we start a board mirroring project?
Post by: Mister Bison on January 12, 2013, 08:31:02 AM
ouch. 219 GB. And I thought I could just go there and look for a previous post. I was wrong.

Somebody planning to host it somewhere ?
Title: Re: Can we start a board mirroring project?
Post by: The Fifth Horseman on January 12, 2013, 06:55:40 PM
219 gigs? What. That doesn't compute...
Title: Re: Can we start a board mirroring project?
Post by: Mister Bison on January 12, 2013, 11:05:44 PM
Quote from: The Fifth Horseman on January 12, 2013, 06:55:40 PM
219 gigs? What. That doesn't compute...
That's tar'ed worth. The actual data could be 1 TB.
Title: Re: Can we start a board mirroring project?
Post by: The Fifth Horseman on January 13, 2013, 12:06:32 AM
Damn. You are right.
*strangling noises*
Title: Re: Can we start a board mirroring project?
Post by: Victoria Victrix on January 13, 2013, 02:07:46 AM
Forumites are chatty.  Heavy noise-to-signal ratio too.
Title: Re: Can we start a board mirroring project?
Post by: Quinch on January 13, 2013, 02:20:04 AM
I'm guessing it's the pictures. My own copy is a little under 25 gigs, uncompressed.
Title: Re: Can we start a board mirroring project?
Post by: Mister Bison on January 13, 2013, 10:03:19 AM
Quote from: Quinch on January 13, 2013, 02:20:04 AM
I'm guessing it's the pictures. My own copy is a little under 25 gigs, uncompressed.
The guys from webarchive did say that their in-house scripts were nothing like what you found out there, so I don't know what you've used, but it may be possible that you have also backed up far less information than them. I don't think there is 250 gb of image hosted on anything under cityofheroes.com/, because that's what was needed to be backed up. And that must only be profiles images and forum style-related images.

-EDIT: wait, I was wrong. The extension is TAR, it's uncompressed (no more than the files inside are) so it's 219GB worth of data, period.
Title: Re: Can we start a board mirroring project?
Post by: Quinch on January 13, 2013, 10:06:29 AM
Just the threads, basically - the raw HTML.
Title: Re: Can we start a board mirroring project?
Post by: Arachnion on January 13, 2013, 09:34:36 PM
Quote from: Quinch on January 13, 2013, 10:06:29 AM
Just the threads, basically - the raw HTML.

But.... what about the flavor? the culture? the cat pictures?!

:D