facebook最近动作频频,不光开源了folly c++库,还专门组织了c++专场,有兴趣的可以参考:
I managed to snatch tickets for the Facebook C++ Conference. I will be live blogging the event all day. Hope you enjoy!
20.49
Summary
Mainstream hardware is becoming permanently parallel, heterogeneous and distributed.
- Moore’s Mine is ending, but we know the answer: Open a new mine.
- As usual, the end of one wave overlaps the beginning of the next.
The long-term free lunch app requires lots of juicy latent parallelism expressed in a form that can be spread across a machine with a variable number of cores of different kinds – big/small/specialized cores, and local and distributed cores.
- The filet mignon of throughput gains is still on the menu, but now it costs extra effort – development, complexity, and testing.
20.12
Hardware is in motion! This talk is tougher to describe, as the presentation relies on many graphics.
20.00
We are observing the end of Moore’s Law. We can no longer rely on single-threaded performance increases, we need to go beyond multi-core to cloud-core and heterogeneous cores.
19.52
Trifecta
- Multicore (2005-) e.g. iPad 2, Kindle Fire, etc..
– Initial transition now complete in all mainstream form factors
- Heterogeneous cores (2009-) e.g. compute-capable GPUs.
– 100x and 1000x parallelism already available on home machines for apps that can harness the GPU.
- Elastic compute cloud cores (HaaS, 2010-) e.g. AWS, Azure
– Summer 2011
19.50
We are entering a Jungle! 2011-201x Put a heterogeneous supercomputer cluster on every desk, in every home and in every pocket.
19.47
Welcome to the Jungle by Herb Sutter
19.35
Short break before the next talk by Herb Sutter.
19.30
Summary
Performance and Efficiency
Critical for Technological Advancement
enables Fundamentally New Products and Experiences
Incremental Steps Add Up to Game-Changing Optimizations
19.28
Summary
AtomicHashMap
Atomic Operations are Slow
- Remove Ordering Constraints if Possible
Unique Set of Trade Offs
- Not Right For All Applications
19.18
Reference Implementation
Micro-Sharding
- Fragment Data to Reduce Contention
- Use Spin Locks Since Contention is Low
19.13
folly::AtomicHashMap
Benchmarking
- Simple
- Isolate Important Components
19.10
Projecting Hash Values
Hybrid Solution:
- Make Array Optimal Size
- Mask by Next Power of 2
- If Outside the Array, Use Modulus
19.09
Projecting Hash Values
Modulus is Slow
- Integer Division
Masking With Power of 2 Is Wasteful
- nextPowTwo(1000 Entries/ 80% load factor = 1250)
19.06
Approach
Increments Cached Per Thread
- Periodically Flushed Atomically
- Traverse All Threads for Exact Value
New ThreadLocal Class
- Fast
- Provides Iterator Across All Threads
19.01
folly::ThreadLocal
Implementation
Store Arrays of Pointers with __thread
Unique Index Per ThreadLocal Instance
User Mutex on ctor/dtor/iterate
- One per “Tag”
About 4x Faster than boost::thread_specific_ptr
18.58
Tracking Size
Obvious Solution : Atomic Increment
- But wait! 50M inc/sec < 100 insert/sec
- Atomic Increment is a Bottleneck
Try again: ThreadCachedInt
Sharded Atomic Increment
- Requires a Lot of Shards
Thread Local Storage
- Better with New ThreadLocal Class
18.56
Remaining Issues
How to Track Size?
Modulus is Slow
18.54
Erase
1. Do a Find
2. Miss -> Failed Erase
3. Found ->compxchg to Empty Sentinel Key
Empty -> Locked -> Valid -> Erased
18.50
Growth
1. Fill to Max Load Factor
- Can’t Rehash
2. Create Another SubMap
3. Re-Project and Continue Probing
18.43
Atomic Probing Algorithm
Always in a Consistent State
- find() has No Memory Barriers
insert() Uses Keys for Locking
- Very Low Contention
- No Additional Storage
18.41
Probing Algorithm
- Project Hash Onto Array
- Advance Until Equal or Empty
- Wrap Around
18.39
New Approach: AtomicHashMap
folly:AtomicHashMap
Result Preview
- Blistering Speed (1.3x-5x)
- Memory Efficient + Compact References
- std::unordered_map Interface
Except…
- Only Integer Keys
- Sensitive to Initialization Size
- Erased Entities Never Reclaimed
18.36
Graph Storage and Access
Possible Approaches
Micro-Sharding
Layered Maps
Intel tbb:concurrent_hash_map
Buil Something New
18.36
Graph Storage and Access
Considerations
- CPU Bottleneck
- Memory Sensitivity
- Many Cores Accessing the Same Data
- Updated in Real-Time
18.35
What’s it look like?
Huge trove of data, many users have 1M 2nd degree connections
Data and Access Highly Dynamic, live updates, new products
Bottom line: we need performance
18.33
Facebook: The Graph?
What’s it Good For?
News feed, what’s going on in the action graph.
Search, find people and things
Ads, Groups, etc.
18.32
What should we do about performance? Challenge assumptions, hardware tradeoffs are changing, talk and share code.
18.30
At facebook: incremental changes add up, speed is critical for virality, help the environment -> performance is very important.
18.29
Massively Parallel Hashmaps: Spencer Ahrens
Why do we care about hashmap performance?
18.19
Short break between talks, will be back!
18.15
18.14
Folly components:
1. Containers
2. Parsing and formatting
3. Multithreading helpers
4. General utilities
5. More to come
18.13
fbstring will be open-sourced!
folly: Facebook’s Open-Source Llibrary
18.13
Andrei emphasizes that you should learn math, cs people really need to know math.
18.06
fbstring speedup – 2.53 times faster with ILP and loop unrolling
18.02
modulo subtraction is injective
17.55
ILP speedup to atoul gives a 193% speedup over atoul.
17.53
associative means parallelizable
17.53
Minimizing data dependencies
“5235″ -> 5235! string to integral (ids are numbers stored as strings)
string atoui -> loops each char and converts to the appropriate digit… see atoui implementations
fbstring actually takes advantage of loop unrolling when converting strings to ints.
17.45
Better instruction level parallelism = fewer data dependencies
17.45
Instruction Level Parallelism
1. Pipelining
2. Superscalar execuion
3. OoO execution
4. Register renaming
5. Speculative execution
17.43
The Forgotten Parallelism
People forget about CPU parallelism which could be taken advantage of. Each core in facebook server, nehalem cpu, has 3 ALUs which are forgotten. You can do three simultaneous arithmetic calculations in a single thread! (SandyBridge has more than 3 ALUs). How do we use them?
17.39
Some Benchmarks
push_back speedup vs std::string is huge compared to the standard implementation. 400% speedup for 1 push_back call, 500% for 2 push back calls, less than 200% for 256k push_back calls. Plateau is 1.4 times faster than standard implementation.
17.34
The French Connection: jemalloc
Jason Evans wrote a great allocator for FreeBSD and it’s used everywhere at facebook.
jemalloc
1. Highly concurrent
2. Cache-friendly
3. Fast
Extended allocator interface
1. Allocate in allocator-sized slabs
2. In-place expansion where possible
fbstring solves fragmentation by knowing what jemalloc wants, the string will use what jemallocs wants to allocate its internal representation.
jemalloc allows reallocation in place – e.g. a large string built in a loop will be improved and occasional quadratic algorithms become linear! woot!
17.27
23 >> 15 (23 is the max chars in a fbstring, 15 is what other libraries use for string)
23, 15 is not a big deal… actually it is because at facebook they have many protocols that store ints as strings, many ints are smalls but not all! User ids at facebook are randomly generated, 64 bits. Almost all these ids don’t fit in 15 chars… thus will spill into allocated memory but with fbstring none will spill and thus be highly optimized.
17.24
fbstring has control. fbstring lets them take full control over where data goes, optimizes memory allocation. The performance impact is huge.
Andrei is making fun of Python/Java programmers for not knowing what happens in memory.
17.22
fbstring data layout
Small
hello, world! slash0 | 9 (9 is the number of bytes unused, allows them to append quickly and the last byte is used as a null terminator when there are no more bytes unused)
Medium/Large
char* data | size_t length | size_t capacity | FF
(if FF is all ones then we will consider the string in this format otherwise in the small format.
Large
atomic
17.17
fbstring’s 3-tiered layout
1. In-place store for 0..23 chars
2. Eagerly copied values 24..255 chars
3. Copy-on-write 256 chars and above
17.16
We are live again! Andrei is speaking, the title of the talk is Sheer Folly.
C++, why do we care? Why do we still care about system languages.
1. Data Layout Control – How data sits in memory? C++ allows you to literally tell where ever byte goes. Example: fbstring. Hang on, let’s not invent the wheel here but we did… demotivation for std string : 1 – std::string as a container is fail – 2 – std::allocator is fail – 3 – std::char_traits is fail – 4 – algorithm integration is fail – 5 – copy-on-write became fail… motivation: 1 – because its there which lets us take existing code and link against fbstring
2. Cooperation with the allocator
3. Copy on write optimization
16.06
Lunch Break! Following lunch, Andrei Alexandrescu will be making an announcement of interest to C++ programmers around the world. Stay tuned.
16.02
Slides from the talk will become available available after the event.
16.00
Summary
- Perfect forwarding fails for some kinds of arguments
- To “specialize” forwarding templates, use helper class templates or std:enable_if
- templates + pImple => explicitly force instantiations.
More information on comp.std.c++ discussion initiated 16 January 2010
15.58
VC11 compiler rejects the explicit instantiation. MS has confirmed two compiler bugs but a workaround exists.
The specialization of the size of the const char, we can fix that but it is left as homework to the attendee.
15.56
All you need to do is take a look at the g++ error message, copy it into a source file and this will explicitly instantiate the template but you have to double check the type deduced by the compiler because it may not be the same as the type passed into the function. But g++ is so self aware that it will actually compile/link/run.
15.54
Explicit Instantiation
The templates again:
1
2
|
template
<
typename
T> Widget::Widget(T&& param)
// forwarding ctor
template
<
typename
T>
void
Widget::setName(T&& param)
// forwarding setting
|
The uses again:
1
2
3
|
Widget w("This is a test")
w.setName(std::string("This is another test"));
w.setName(s);
|
Instantionals we need:
Widget ctor with T= const char(&)[15)
setName with T= std::sring and also T = std::string&
15.52
Forwarding and the pImpl Idiom. Pimplemization is straight forward and compiles without fuss on client code, but the linker disagrees and can’t link! Ha! The usual template instantiation problem: compilers need template source for instantiation. Pimple precludes that.
Possible solutions:
- abandon pImpl, i.e. admit defeat (!)
- use different build chain -> some use link-time instantiation
- perform explicit instantiation
15.48
This is the first adventure in perfect forwarding, how do you deal with specializing for different types without messing with the universal binding, syntax T&&, forced on you.
15.46
Specializing for shared_ptr does not specialize for const shared_ptr. To catch both we can add another partial specializing.
15.45
Random comment from the crowd: I love being able to write >> in the new C++11 without the space. Scott Meyer is envious that someone can be so happy about that.
15.44
We definitely want to have a specialization for shared_ptrs.
15.44
Specializing for pointers unrealistic – rarely useful to distinguish pointer lvalues and rvalues.
Specializing for UDTS very realistic.
- Often useful to distinguish lvalues and rvales
- std::shared_ptr is such a UDT. Moving cheaper than copying.
15.42
Dispatching through a Class Template … enables use of class template partial specialization allows us to avoid the enable_if problem.
15.38
Too much code on the slide but there are several drawbacks to enable_if, especially when we start using expressions such as const char*
15.37
One solution: overload via std::enable_if
Below code is generalization…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
template
<
typename
T>
typename
std::enable_if<
!std::is_pointer<
typename
std::remove_reference<T>::type
>::value
>::type
fwd(T&& param)
{
f(std::forward<T>(param))
}
template
<
typename
T>
typename
std::enable_if<
std::is_pointer<
typename
std::remove_reference<T>::type
>::value
>::type
fwd(T&& param)
{
f(std::forward<T>(param))
}
|
15.28
!!!
Function templates can’t be specialized, only overloaded. But the form of forwarding templates is constrained, the only parameter type that works is T&&!. Works = handles all combos of const/non-consts expressions.
15.27
“Specializng” forward templates
Forwarding to f is easy
1
2
3
|
template
<
typename
T>
void
fwd(T&& param){
f(std::forward<T>(param));
}
|
but what if we want to specialize our template for pointers?
15.26
Several kinds of arguments can’t be perfect-forwarded.
* 0 as null pointer constant
* braced initializer lists
* integral consts…
15.25
std::make_shared is a function that will return a shared pointer and will typically more efficient than creating your object and than wrapping it in a shared pointer.
std::make_shared:
1
2
3
4
5
6
7
|
std::string s("PF example");
auto pw = std::make_shared<Widget>(s, std:vector<
int
>({1,2,3}));
template
<
typename
T,
typename
.. ParamTs> shared_ptr<T> make_shared...
// in g++
this
will go through several layers, and thus perfect forwarding becomes extremely desirable.
|
15.17
Perfect forwarding is our main interest. Perfect forwarding allows you to pass params straight through templates, we have a common use case, when we build constructors in templated classes. This is when we have problems with the new move construct… perfect forwarding is difficult because we will have to be careful with rvalues and rvalue references.
15.13
In order to support moves in C++11, we have a new construct called rvalue references. New syntax type&& will give you rvalue references. Rvalue references identify candidates for moving versus copying.
Must be careful, in function templates type&& are not really rvalue references. They mean “bind to anything” -> universal references.
Example:
1
2
3
4
|
template
<
typename
T>
void
f(T&& param);
int
x;
f(x);
//lvalue: T is int&, instantiates f(int&)
f(factory());
// rvalue: T is std::shared_ptr<Widget>, instantiates f(std::shared_ptr<Widget>&&)
|
15.07
Scott will be speaking about C++11. Lvalues are generally expressions that take an address and they are important for taking copies, copy requests. In C++11, copy requests can often become move requests, the langauge is now taking more care into the difference between l and r values.
15.05
C++ is seeing a new growth because we can’t get faster processors very much anymore. Scott will now give his talk.
15.04
Opening remarks starting, Andrei Alexandrescu is our MC, hosting the first #fb #cpp #conf. There is a great demand and knowledge for C++ and facebook is very excited about this new growth in interest.
14.46
Slides are coming up: Adventures in Perfect Forwarding by Scott Meyers. Opening remarks are expected to be on time. #fb #cpp #conf
14.31
Blinds are going down here at the fb cpp conf. Stage is being prepared. 30 minutes to go before the opening remarks. Scott Meyer’s talk will follow.
14.13
48 minutes to go before opening remarks, the conference hall is slowly filling up.
13.59
The Facebook conference registration will be starting at 11. The’ve got sweet nametags.
1
2
3
|
[&facebook]() {
return
std::move(fast) & break_things();
}
|