知名站点技术架构大曝光之Flickr

Flickr

平台
请求调度:Squid作html和images的反向代理.
页面技术:PHP(Perl)
                    采用Smarty模板技术
WEB服务器:Apache
缓存:Memcached.
逻辑技术组件&方案: ImageMagick作图像处理
          PEAR用作XML和Email解析
          Java, 作节点服务(node service)
数据库:MySQL (Master-Master数据分片)Shards
操作系统:Linux (RedHat)
系统部署:采用SystemImager作自动化部署。
                    采用Subcon在SVN中存储关键的系统配置文件以方便集群部署.
                    采用Cvsup跨网络分发和更新文件集合.
系统监控:Ganglia作分布式系统监控

架构
中央数据库保存用户信息表,其中包括用户主键,还有用来定位用于存放用户信息的分片数据库路径的索引值等。
专门的server用来存放静态资源。
采用share nothing存储架构。
通过冗余来扩展可用性,但仅限于读操作。
通过对待搜索数据的冗余备份作了一个search farm。
采用横向扩展方式,这样扩容仅仅只是添加机器就足够。
早前曾吃过Master-Slave的苦头,过高的负载,并且存在单点实效。
分片标示值是对新用户生成的一个随机数。
持续的进行数据迁移,迁移工作是手工完成的。192'000张图片,700'000个tags,大约需要3~4分钟。

点击Favorate
从cache中获取图片所有者的帐户信息,以得知帐户的分片数据库位置。
从cache中获取个人信息,以得知帐户的分片数据库位置。
开启一个“分布式事务”,询问:谁最喜欢这张图片,我最喜欢的图片有哪些?

Can ask question from any shard, and recover data. Its absolutely redundant.
To get rid of replication lag…
- every page load, the user is assigned to a bucket
- if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.
Every users reads and writes are kept in one shard. Notion of replication lag is gone.
Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode. To upgrade you just have to shut down half the shard, upgrade that half, and then repeat the process.
Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.
Average queries per page, are 27-35 SQL statements. Favorites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.
Over 36,000 queries per second - running within capacity threshold. Burst of traffic, double 36K/qps.
Each Shard holds 400K+ users data.
- A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.
Search:
- Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
- Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
- Think of it such that you’ve got Lucene-like search
Hardware:
- EMT64 w/RHEL4, 16GB RAM
- 6-disk 15K RPM RAID-10.
- Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger).
- 2U boxes. Each shard has~120GB of data.
Backup procedure:
- ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare.
- Snapshots are taken every night across the entire cluster of databases.
- Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea.
- However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.
Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.
Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.
max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.
Tags:
- Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags.
- Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.
Future Direction:
- Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.

统计
每天超过40亿的查询
共有35M个图片被缓存在squid中
共有2M个图片被缓存在squid RAM中
470M图片,每张图片大约有4到5种尺寸
38K次请求/每秒到memcached(缓存了大约12M对象)
2PB原始数据存储,每天新增大约400'000张图片。

你可能感兴趣的:(模式&架构,search,系统监控,transactions,数据库,server,database)