With the Zwillinger news I expect the official boards may go rather quickly. Please act so the board is not GONE TO THE AMERICANS.
I am ridiculously busy and involved in another project I intend to announce soon, but board parsing should not be a particularly tech intensive task if any of you are willing.
Yeah, I'm thinking along the same lines, too: the fewer red names present, the earlier they'll just axe the boards to avoid liability (player makes racist/hate speech comments, they linger, lawyers pirouette, peasants get lulz, and NCSoft closes them anyway).
I believe several people are already working on this - the thread may have dropped to the third page or so, but it's here.
I saw a lot of discussion but no commitments.
I saw links, discussion, and concepts, but no solid confirmations.
I'm doing some mirroring/archiving, but I would certainly encourage anyone else who wants to do so as well. I do not have what I downloaded available online anywhere yet though.
I have three archives:
The first archive is everything except the forums. So that would include pretty much anything at *.cityofheroes.com that can be found through recursively linking from the main website, as well as content on a few other related domains. So this includes domains such as ftp.coh.com, goingrogue.na.cityofheroes.com, ftp.ncsoft.com. I last took a snapshot around September 10 but I may take a new snapshot again later this week. This archive is more than 20 GB in size, due to media files.
The second archive is just the forums. My initial attempts to archive this met with some difficulties. Because it's a dynamic site with many ways of viewing the same information, you end up with many many copies of the same information. And because it's all "flat" it was making my computer cry to have so many files in a single directory. However, this thread got me investigating again and I discovered that there's an archive friendly (http://boards.cityofheroes.com/archive/index.php) version of the forums that strips out links, images, formatting, etc. Now that is great for archiving! So last night I kicked off an archive job and it downloaded successfully, though it stopped at I think 1,000,000 links (because I didn't realize that was a default). I now have it set to download with a much higher threshold. I don't know how long it will take to finish, but I'll try to make sure I get at least one full snapshot. Not sure how big it will end up being, but the partial download was 1 GB.
The third archive is that I downloaded all the videos from their Twitch.TV, Ustream, and YouTube accounts. This is nearly 60GB of data.
I'm also keeping the archives I make backed up on SpiderOak (https://spideroak.com/) so I'm reasonably confident I won't lose them due to drive failure or anything. However, as I said, I wouldn't want to discourage anyone from making their own archives; more copies is safer. I'm using HTTrack (http://www.httrack.com/) to make my website archives.
There may be a way to specifically archive vbulletins in a way that may convert to a database however I found none.
You could filter URLs to ignore certain types of categories. For example only capture the indexes, postings, and profiles.
Quote from: Chaos Ex Machina on September 23, 2012, 06:23:30 PM
You could filter URLs to ignore certain types of categories. For example only capture the indexes, postings, and profiles.
That was actually what I was starting to do yesterday when I started, until I noticed the archive version of the site. The archive version gives us all the post contents and forum indexes, so I figure that's the highest priority to capture at least initially. Maybe once I have a solid archive of that, I may try to expand to the "normal" version of the forums to capture images and formatting, though I have a suspicion that'll blow the size of the archive up quite significantly.
Quote from: Sekoia on September 23, 2012, 06:02:07 PM
I'm doing some mirroring/archiving, but I would certainly encourage anyone else who wants to do so as well. I do not have what I downloaded available online anywhere yet though.
Sekoia, will your archives of the forums be available for those of us who may want to search them for info? (For future articles, quotes, contact info, etc.)
Archiving a forum without server access is not something we know how to do in my household. If someone wants to give me a few instructions, I'm happy to be another archiver.
Quote from: Windy on September 24, 2012, 12:52:19 AM
Sekoia, will your archives of the forums be available for those of us who may want to search them for info? (For future articles, quotes, contact info, etc.)
I don't have any specific plans for how to make them available just yet, but when the official sites go offline I'll work with Titan to see if there's some way we can make them publicly available. I'd definitely like to see that happen.
Quote from: Windy on September 24, 2012, 12:52:19 AM
Archiving a forum without server access is not something we know how to do in my household. If someone wants to give me a few instructions, I'm happy to be another archiver.
Here's the software I used for it: http://www.httrack.com/ I'm currently running an archival job so I can't check to see what settings I used specifically right now, I'll try to share them later though. The software isn't too terribly hard to learn, especially if you read through their documentation (http://www.httrack.com/html/index.html).
UPDATE: I managed to finish downloading the archive version of the forums. It totals about 4.5 GB. Note that this also only includes the publicly viewable parts of the forums, not the stuff that requires log-in.
The problem I've found with website scrapers and bulletin boards is that they treat every link as something to be downloaded - even when the link leads to the post in the same thread, leading to a veritable and perpetual explosion.
I'm thinking of putting something together that will simply go page by page and download thread by thread, but as I'm still getting the hang of C#, I know I can do it, but not sure if I can do it quickly enough.
Quote from: Quinch on October 01, 2012, 07:22:12 AM
The problem I've found with website scrapers and bulletin boards is that they treat every link as something to be downloaded - even when the link leads to the post in the same thread, leading to a veritable and perpetual explosion.
Yeah, I gave up my first attempts at mirroring the forums because of exactly that. Perhaps with filters it could be made more reasonable, though.
The archive version of the forums only represents each post exactly once, so the issue you cite fortunately wasn't a problem there. Unfortunately, the archive version is far from perfect. It strips off formatting and drops images.
Quote from: cmgangrel on October 01, 2012, 11:43:42 AM
Which means that the Beta testing boards are not gathered (ie the ones you see if you log in on a VIP account)
This is correct.
What about Closed Beta forums? I still have "subscriptions" to the one or two I was in.
My current archive only contains the publicly-viewable parts of the forums, so that excludes VIP and Beta stuff.
I'm now making an attempt to archive the VIP section (I found out how to have WinHTTrack log in), but unfortunately I don't have access to any Beta forums that might exist so I wouldn't be able to archive them. I'll post again when the mirroring job is complete to note whether it successfully captured the VIP stuff.
I can tell you that http://boards.cityofheroes.com/forumdisplay.php?f=726 leads to the main I-20 Pre-Beta forums, if you can download it
Nope, tells me I don't have permission to access it (even though I'm logged in). :(
Maybe you can ask a former Dev who still has forum log in privileges...?
Paying closer attention and to clarify, the VIP forums I have access to are actually for Issue 24 beta (for some reason the fact that they were both VIP and Beta wasn't clicking for me...):
Issue 24: Resurgence [VIP Beta] Forums
- Issue 24: Resurgence [VIP Beta] Announcements Forum
- Issue 24: Resurgence [VIP Beta] General Discussion Forum
- Issue 24: Resurgence [VIP Beta] Feedback Forum
- Issue 24: Resurgence [VIP Beta] Bug Reports Forum
Are the previous issues (such as I-20) beta forums still actually there? I was under the impression they nuked them after a while.
I honestly don't follow the official forums very much, so sorry if that sounds clueless. :)
Nope, they are there. I can still see all the contents of http://boards.cityofheroes.com/forumdisplay.php?f=734
Maybe parent directories are hidden but not children?
A few months ago I discovered the ascending order of new forums and found an empty forum and posted in it, wreaking some havoc for one night. The thread was excised the next morning and the phantom forum locked.
AFAIK they don't erase the forums - they make them invisible - since they sometimes refer back to the threads.
Interesting! Once I confirm that HTTrack is actually working properly for the log-in sections, I'll send someone a PM and inquire about the beta forums then. Doesn't hurt to ask. :)
http://wayback.archive.org/web/*/http://boards.cityofheroes.com
I think that this site has already archived the forums a few times over the years (going back to 2004, with the latest being quite recent).
The problem with the Wayback machine is that you won't know if they captured content for a very long time. Their most recent capture of the official forums is July 2011, which is over a year ago. Plus a lot of their copies are are only partial -- I keep getting on their "liveweb" interface when browsing their archive. So they're good for what they already have, but I don't want to depend on them for being complete and up-to-date.
I KNEW it was too good to be true :(.
Maybe we can ask Zwill to unsecret a lot of the old CB boards?
I was part of Issue 18 closed beta, as well as one of the ones prior, but I can't remember if it was immediately prior or something like Issue 16 or even 15. If anyone has a link to the Issue 18 closed beta to try, I'd be happy to get that one. Maybe we farm out the closed beta forums? Put out a Call?
Just an update -- my attempt to mirror the I24 VIP Beta forums failed. I guess the cookies didn't work right. Of course, the official forums have a tendency to act funny with their cookies anyway, so it's not a big surprise to me. I'll try making another attempt at some point in the next few days, hopefully I'm just doing it wrong and can figure it out.
So at this point, I still don't have a way to archive anything behind a login.
Could you link us to what you have?
Quote from: Sekoia on October 01, 2012, 05:23:06 PM
The problem with the Wayback machine is that you won't know if they captured content for a very long time. Their most recent capture of the official forums is July 2011, which is over a year ago. Plus a lot of their copies are are only partial -- I keep getting on their "liveweb" interface when browsing their archive. So they're good for what they already have, but I don't want to depend on them for being complete and up-to-date.
Not to mention Wayback machine won't always mirror actual threads, just the main page. So you can get a month by month snapshot of what the last threads were updated that day or how many replies a topic got, but that's it.
IIRC they made a complete backup of the forums around November, but it's not part of Wayback Machine.
http://archive.org/details/archiveteam-city-of-heroes-main
From Sept 2012.
Awesome. Thanks.
ouch. 219 GB. And I thought I could just go there and look for a previous post. I was wrong.
Somebody planning to host it somewhere ?
219 gigs? What. That doesn't compute...
Quote from: The Fifth Horseman on January 12, 2013, 06:55:40 PM
219 gigs? What. That doesn't compute...
That's tar'ed worth. The actual data could be 1 TB.
Damn. You are right.
*strangling noises*
Forumites are chatty. Heavy noise-to-signal ratio too.
I'm guessing it's the pictures. My own copy is a little under 25 gigs, uncompressed.
Quote from: Quinch on January 13, 2013, 02:20:04 AM
I'm guessing it's the pictures. My own copy is a little under 25 gigs, uncompressed.
The guys from webarchive did say that their in-house scripts were nothing like what you found out there, so I don't know what you've used, but it may be possible that you have also backed up far less information than them. I don't think there is 250 gb of image hosted on anything under cityofheroes.com/, because that's what was needed to be backed up. And that must only be profiles images and forum style-related images.
-EDIT: wait, I was wrong. The extension
is TAR, it's uncompressed (no more than the files inside are) so it's 219GB worth of data, period.
Just the threads, basically - the raw HTML.
Quote from: Quinch on January 13, 2013, 10:06:29 AM
Just the threads, basically - the raw HTML.
But.... what about the
flavor? the culture? the cat pictures?! :D