Brad Fitzpatrick ([info]bradfitz) wrote in [info]lj_backend,
@ 2004-10-18 01:14:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
MogileFS transition
As of tonight, all userpics, phoneposts, and captchas are now stored on our MogileFS file storage system.

Our old system, while well-intentioned, was pretty cheesy and lame technically. It was never meant to be used for long ... it was mostly just a crutch until we figured out what we really wanted to do.

Here's a snapshot of our MogileFS installation at present. We have 6.14 TB free. And if that's not enough, we have 10 machines on-hand that could store 1TB each if we run out of room. We'd just need to throw 4 hot-swap SATA disks in them.
lj@grimace:~$ mogcheck.pl
Checking mogilefsd availability...
        10.0.0.81:7001 ... responding.
        10.0.0.82:7001 ... responding.

Device information...
  hostname     device   age    size(G)       used       free    use%  delay
      sto1       dev1   56s    224.319     15.022    209.297   6.70% 0.004s
      sto1       dev2   56s    229.161      9.337    219.823   4.07% 0.004s
      sto1       dev3   56s    229.161      9.273    219.888   4.05% 0.005s
      sto1       dev4   56s    229.161      9.308    219.853   4.06% 0.004s
      sto1       dev5   56s    229.161      9.271    219.890   4.05% 0.013s
      sto1       dev6   56s    229.161      9.409    219.752   4.11% 0.009s
      sto1       dev7   56s    229.161      9.305    219.856   4.06% 0.005s
      sto1       dev8   56s    229.161      9.342    219.819   4.08% 0.004s
      sto1       dev9   56s    229.161      9.298    219.862   4.06% 0.007s
      sto1      dev10   56s    229.161      9.245    219.916   4.03% 0.008s
      sto1      dev11   56s    229.161      9.334    219.826   4.07% 0.004s
      sto1      dev12   56s    229.161      9.281    219.879   4.05% 0.005s
      sto1      dev13   56s    229.161      9.364    219.797   4.09% 0.006s
      sto1      dev14   56s    229.161      9.295    219.865   4.06% 0.008s
      sto2      dev15   10s    224.319      9.342    214.977   4.16% 0.004s
      sto2      dev16   10s    229.161      9.317    219.843   4.07% 0.006s
      sto2      dev17   10s    229.161      9.394    219.767   4.10% 0.005s
      sto2      dev18   10s    229.161      9.387    219.774   4.10% 0.005s
      sto2      dev19   10s    229.161      9.236    219.925   4.03% 0.004s
      sto2      dev20   10s    229.161      9.312    219.849   4.06% 0.006s
      sto2      dev21   10s    229.161      9.211    219.949   4.02% 0.005s
      sto2      dev22   10s    229.161      9.312    219.849   4.06% 0.010s
      sto2      dev23   10s    229.161      9.231    219.930   4.03% 0.004s
      sto2      dev24   10s    229.161      9.370    219.791   4.09% 0.006s
      sto2      dev25   10s    229.161      9.305    219.856   4.06% 0.008s
      sto2      dev26   10s    229.161      9.243    219.917   4.03% 0.013s
      sto2      dev27   10s    229.161      9.264    219.896   4.04% 0.009s
      sto2      dev28   10s    229.161      9.326    219.834   4.07% 0.004s
                total         6406.817    266.336   6140.481   4.16% 0.173s

Those top two lines are checking on the mogilefsd trackers... they're the servers that keep track of where all the files are at. They're actually just a protocol translator in front of the same MySQL database. And if that database goes down? Well, then we'd be screwed. That's why the database is currently on really nice hardware. But the real plan going forward is to use MySQL Cluster, which we'll be using for our global master DB as well. Then there'd be no single point of failure at all.

Oh, and the MogileFS info shown above is for all of livejournal.com, pics.livejournal.com, and picpix.com.... when you make your MogileFS client object, you just specify what domain you're using. For instance, "danga.com::fb" (for fotobilder) or "danga.com::lj" (livejournal). Then you can have identically named files in all namespaces that don't conflict.

If anybody's interested in using MogileFS, we'd love to help you set it up. Join the list and ask away.


(Post a new comment)


[info]kloostec
2004-10-18 09:27 am UTC (link)
So, let me get this straight...

If a disk dies, you just pop out the disk and put in a new one, partition it, and add it back into the tracker? If a server dies, all critical data is redundant (based on class)? I guess then your load would go up a bit while the lost unimportant data was regenerated (I'm assuming this would be things like photo thumbnails)? You could just add a new server with a bunch of disks and everything would be replicated over to restore redundancy?

I think I've got everything above correct (according to the principles behind what you're doing). If so, that's a pretty intense system you've got going there. None of the projects I'm doing requires nearly that amount of space, but next time I write a livejournal, I'll be sure to take a closer look :P

By the way... what SATA controller and drives are you using?

(Reply to this)(Thread)


[info]bradfitz
2004-10-18 04:11 pm UTC (link)
3ware-9xxx controllers with 16 SATA disks.

A class is just an attribute of a file which specifies its minimum replica count.

Yeah, a disk dies and all files which were stored there are copied to other disks on the same host.

There's no concept of a host dying... just being "down for maintenance". But if it really does catch on fire and destroy all the disks, yeah, we can mark all the devices on that host as dead.

(Reply to this)(Parent)(Thread)


[info]drstein
2004-10-18 08:32 pm UTC (link)
I figured that you'd be using 3ware cards..

but what the hell kind of *cases* did you find to hold all of those disks? :P

I think it's time for some photos of the LiveJournal server farm. Us geeks are curious!

(Reply to this)(Parent)(Thread)


[info]cuban321
2004-10-19 12:51 am UTC (link)
I second the photos.

(Reply to this)(Parent)


[info]chrisbolt
2004-12-04 03:15 pm UTC (link)
These aren't necessarily what livejournal uses, but...

http://www.supermicro.com/products/chassis/3U/933/SC933T-R760.cfm

(Reply to this)(Parent)


[info]cuban321
2004-10-18 11:14 am UTC (link)
Woah, this is sick.......

(Reply to this)


[info]pne
2004-10-18 11:43 am UTC (link)
Why is something like this better than using a content distribution network such as Akamai or Speedera, which is what I gather you're replacing with this?

Wouldn't it be advantageous to the clients, not to mention better for your bandwidth, to have content served from hosts distributed world-wide and not all sitting in one LAN in a server farm?

(Reply to this)(Thread)


[info]mart
2004-10-18 12:17 pm UTC (link)

This is just for distribution of the storage at LiveJournal. There's no reason why Akamai can't then hit, for example, the userpic URL on LJ and cache the picture. MogileFS is replacing the blob storage and blobserver, not Akamai.

(Reply to this)(Parent)(Thread)

LiveJournal and Akamai
[info]pne
2004-10-18 01:06 pm UTC (link)
MogileFS is replacing the blob storage and blobserver, not Akamai.

Ah; perhaps I drew the wrong conclusion from "We're beginning our move away from Akamai"—I assumed this meant "We will serve content ourselves rather than through Akamai" though I suppose it could simply mean "We will use content provider X instead of Akamai".

(Reply to this)(Parent)


[info]bradfitz
2004-10-18 04:12 pm UTC (link)
As Mart said, the move away from Akamai and the move to MogileFS are indepedent, but related.

We've been meaning to move away from Akamai for a while now. We needed to get everything on MogileFS first because it performs a lot better than our old storage system.

(Reply to this)(Parent)


[info]agreg
2004-10-18 05:42 pm UTC (link)
How come you've decided to move away from Akamai? Since you're doing so many hits for static stuff, it would surely it would make sense to just give that to a CDN like Akamai rather than do it yourself?

What's the cost like compared to having Akamai do all your static stuff, and doing it yourself? IIRC you're already pushing 100Mbps+ out to the internet :)

(Reply to this)(Thread)


[info]bradfitz
2004-10-19 03:55 am UTC (link)
Cheaper to do it yourself nowadays. At least for us, but for most people, actually. Bandwidth is just so cheap, servers are so fast and operating systems and webservers are so capable.... Akamai's "value" is draining quick.

(Reply to this)(Parent)


[info]bifrosty2k
2004-12-02 02:50 am UTC (link)
Akamai costs about 10x what buying straight out BW costs.
You can buy good quality BW (@100Mbps) for $40/meg pretty easily these days.

If you have the bling, Akamai is great at what it does, but you pay the price for it. Same thing with Internap, except you get even less for the money. Internap BW is worth 50% of what they sell it for; You can get a pretty reasonable number of providers who are cheap and good and not pay out the wazoo for it.

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…