Hacker News上一篇不错的回答。不同情况下养成不同的习惯:
极小的公司(或试验性质的项目):
写的代码比读的多,优化写操作。
成熟点的公司里面读代码比写代码多,优化读操作。
This is a consequence of the reader:writer ratio - startup code is written a lot more than it is read and so readability matters little, but mature code is read much more than it is written. (Switching to the latter culture when you need to develop like the former to get users & funding & stay alive is left as an exercise for the reader.)
Meta-habit: learn to adopt different habits for different situations. With that in mind, some techniques I've found useful for various situations:"Researchey" green-field development for data-science-like problems:
- If it can be done manually first, do it manually. You'll gain an intuition for how you might approach it.能手动解决先手动
- Collect examples. Start with a spreadsheet of data that highlights the data you have available.收集案例,建立电子表格程序
- Make it work for one case before you make it work for all cases.先在一个项目上跑起来,再考虑在所有项目上
- Build debugging output into your algorithm itself. You should be able to dump the intermediate results of each step and inspect them manually with a text editor or web browser.排除程序上的算法漏洞,你要能随时倒掉每一步中间产生的垃圾,用text编辑器或者网页来审查
- Don't bother with unit tests - they're useless until you can define what correct behavior is, and when you're doing this sort of programming, by definition you can't.别怕麻烦,用单元检测,当你找到正确的代码是什么就不会觉得之前工作白费了
Maintenance programming for a large, unfamiliar codebase:维护大型项目,不熟悉的代码数据 - Take a look at filesizes. The biggest files usually contain the meat of the program, or at least a dispatcher that points to the meat of the program. main.cc is usually tiny and useless for finding your way around.看一下文件大小,最大的文件通常包含主要部分,或者至少是任务分发,说明项目的主要部分。main.cc通常小而无用
- Single-step through the program with a debugger, starting at the main dispatch loop. You'll learn a lot about control flow.使用挖臭虫程序单独步骤,从主要的分发循环开始。
- Look for data structures, particularly ones that are passed into many functions as parameters. Most programs have a small set of key data structures; find them and orienting yourself to the rest becomes much easier.查看数据结构,特别是那些作为函数的参数的自变量。大多数程序有小部分关键的数据结构,找到他们指导你在剩下的部分变得更容易
- Write unit tests. They're the best way to confirm that your understanding of the code is actually how the code works.写单元测试,这是帮助你确认理解代码工作的最好方式
- Remove code and see what breaks. (Don't check it in though!)移除代码,看会出现什么情况,不要只停留在思考中
Performance work:执行工作 - Don't, unless you've built it and it's too slow for users. Have performance targets for how much you need to improve, and stop when you hit them.除非建好否则不要执行。明确任务目标你需要提高的程度,在运行时挺迟
- Before all else (even profiling!), build a set of benchmarks representing typical real-world use. Don't let your performance regress 回归unless you're very certain you're stuck at a local maxima and there's a better global solution just around the corner. (And if that's the case, tag your branch in the VCS so you can back out your changes if you're wrong.)标记可能出错的地方,回头再修改
- Many performance bottlenecks are at the intersection between systems. Collect timing stats in any RPC framework, and have some way of propagating 繁殖& visualizing the time spent for a request to make its way through each server, as well as which parts of the request happen in parallel and where the critical path is.大多数瓶颈出现在系统交叉部分。
- Profile.
- Oftentimes you can get big initial wins by avoiding unnecessary work. Cache 贮藏your biggest computations, and lazily evaluate things that are usually not needed.
- Don't ignore constant factors. Sometimes an algorithm with asymptotically 渐进地;渐进的worse performance will perform better in practice because it has much better cache locality. You can identify opportunities for this in the functions that are called a lot.
- When you've got a flat profile, there are often still very significant gains that can be obtained through changing your data structures. Pay attention to memory use; often shrinking memory requirements speeds up the system significantly through less cache pressure. Pay attention to locality, and put commonly-used data together. If your language allows it (shame on you, Java), eliminate pointer-chasing in favor of value containment.
General code hygiene:卫生 - Don't build speculatively. Make sure there's a customer for every feature you put in.
- Control your dependencies carefully. That library you pulled in for one utility function may have helped you save an hour implementing the utility function, but it adds many more places where things can break - deployment, versioning, security, logging, unexpected process deaths.
- When developing for yourself or a small team, let problems accumulate and fix them all at once (or throw out the codebase and start anew). When developing for a large team, never let problems accumulate; the codebase should always be in a state where a new developer could look at it and say "I know what this does and how to change it." This is a consequence of the reader:writer ratio - startup code is written a lot more than it is read and so readability matters little, but mature code is read much more than it is written. (Switching to the latter culture when you need to develop like the former to get users & funding & stay alive is left as an exercise for the reader.)
Take a look at filesizes. The biggest files usually contain the meat of the program, or at least a dispatcher that points to the meat of the program. main.cc is usually tiny and useless for finding your way around.This is my #1 pet peeve with GitHub. When I first look at an unfamiliar repo, I want to get a sense of what the code is about and what it looks like. The way I do that with a local project is by looking at the largest files first. But GitHub loves their clean uncluttered interface so much, they won't show me the file sizes!
I think this Chrome extension called "GitHub Repository Size" might be exactly what you are looking for
Yes, absolutely! I really want a whole-repo tree view of files along with their sizes and file types.
Check out Octotree if you're using Chrome. No file sizes still, but I've found that when you just want to quickly explore some potential new source this beats having to clone the repo first.
- Collect examples. Start with a spreadsheet of data that highlights the data you have available.This is true not just for data science but when trying to solve any numerical problem. Using a spreadsheet (or a R / Python notebook) to implement the algorithm and getting some results has helped me in the past to really understand the problem and avoid dead ends.
For example, when building a FX pricing system, I was able to use a spreadsheet to describe how the pricing algorithm would work and explain it to the traders (the end users). We could tweak the calculations and make sure things were clear to all before implementing and deploying the algorithm.
Great advice!
Great advice. One nit to pick:> Don't ignore constant factors. Sometimes an algorithm with asymptotically worse performance will perform better in practice because it has much better cache locality.
Forget the cache, sometimes they're just plain faster (edit in response to comment: I mean faster for your use case). I've e.g. found that convolutions can be much faster with the naive algorithm than with an FFT in a pretty decent set of cases. (Edit: To be specific, these cases necessarily only occur for "sufficiently small" vectors, but it turned out that was a larger size than I expected.) Caching doesn't necessarily explain it I think, it can just simply be extra computation that doesn't end up paying off.
Good correction, but a small second nit.> sometimes they're just plain faster
Not faster for sufficiently large N (by definition).
But your general point is correct.
I've best seen this expressed in Rob Pike's 5 Rules of Programming [0], Rule 3:
Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy.
[0] http://users.ece.utexas.edu/~adnan/pike.html
Not faster for sufficiently large N (by definition).True, but supposedly researchers keep publishing algorithms with lower complexity that will be faster only if N is, like 10^30 or so.
Or so Sedgwick keeps telling us.
Great comment! I have one point to add to '"Researchey" green-field development for data-science-like problems':6. Use assertions for defining your expectations at each stage of the algorithm - they will make the debugging much more easier
To expand on your #2:I work with a lot of traders. One antipattern I noticed is that when there's a problem with the data, they'll do all sorts of permutations and aggregations and then scratch their chins and ponder about it for hours.
Go to the fucking source and find an example of the problem! Read it line by line, usually it will be obvious what happened.
Corollary: Don't assume your data is correct, most outliers ina large data set are problems with the data itself. Build a few columns that serve as sanity checks. One good example is a column that shows the distance between this sequence number and the last, anything >1 is a dropped message.
Great list. Only thing I'd add is lean towards clever and often simple architecture when modelling a solution, it often will beat clever programming..
Addition to Performance/2.: Synchronization costs are typically the biggest deal in applications that involve I/O (e.g. hard drive or network). Try an average database transaction with and without synchronization. 1) On sqlite3, it's dozens vs hundreds of milliseconds. Bigger databases, probably not much difference. 2) Lookup NFS sync issues. It's a huge speed/safety tradeoff. 3) On some file systems, a debian installation may take 10 minutes or 90 minutes depending on whether you disabled sync (eatmydata command).
This is a great comment! If you have a blog could you make it a blog post? It deserves to be read more widely.
I've been tempted to start a blog...I've gotten a few requests on HN...but I'm still in the "rather be a programmer than a web celebrity" phase. I'm afraid it'll be too much of a distraction from my projects. Plus, I usually write better in response to a prompt than coming up with content in a vacuum.
You could perhaps consider turning a such 'response to a prompt' into a full-blown post on whatever easy-to-use blogging system (Medium, wordpress.com, hell even pastebin or its ilk?), and linking to it alongside a comment.On HN, I often check specific commenters activity, or new comments in a thread that can be days-old, because I often find hidden gems. A link to more elaboration would definitely count as such. Perhaps an audience of even just a dozen or so people like me might be worth it.
EDIT: Coincidentally (I swear), there's exactly a dozen of comments that positively engage with your comment!
Could you put a link back to the original source in the doc? That way you capture the comments & discussion, which has a bunch of useful clarifications.
Point #5 about unit tests is so true, wish i knew it before jumping on the bandwagon. I wasted so much time writing tdd code only to learn later that the specs had changed. This can save you an insane amount of pain and time.
I hate printers. I haven't used one in months. However, I just printed this comment and I'm hanging it up in my office. Perfect.
You're a genius! How long have you been into the developing world, if I may ask?
12 years, plus college, a gap year, and a couple internships & projects in high school.It's helped that it's really been 12 (well, 13+change) years of experience, rather than one year repeated 12 times. Each year has brought something new and different that's just out of my comfort zone.
原文:https://news.ycombinator.com/item?id=14709076