weixin_33957648

Becoming a data scientist

Data Week: Becoming a data scientist

Data Pointed, CouchDB in the Cloud, Launching Strata

Life Advice

Career Advice

Computer Science

Machine Learning

Statistics

Data

Data Science

Data Analysis

Data Mining

Big Data

How do I become a data scientist?

Background: I recently finished my bachelor's degree in computer science at Berkeley. Although it may be a bit late, I am just now getting interested in learning more about statistics and "data science." Unfortunately, I don't have much of a math background (only took up to Linear Algebra) and the required probability/discrete math course for CS. Although I started working, I have the option of enrolling in an MS CS program in January. What courses should I be looking at and will a MS in Statistics be more useful? If so, is it possible to get into an MS in Statistics without a strong math background? I will probably be looking into taking machine learning and data visualization.

9 Answers • Stay updated about new answers by joining Quora

Alex Kamil

82 votes by Edwin Khoo, Anon User, Neil Kodner, (more)

Strictly speaking, there is no such thing as "data science" (see What is data science? ). See also: Vardi, Science has only two legs: http://portal.acm.org/ft_gateway...

Here are some resources I've collected about working with data, I hope you find them useful (note: I'm an undergrad student, this is not an expert opinion in any way).

1) Learn about matrix factorizations:

Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numeric Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix decomposition algorithms are fundamental to many data mining applications and usually underrepresented in a standard "machine learning" curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites. I'd recommend these resources for self study/reference material:

BellKor, Matrix factorization for recommender systems: www2.research.att.com/~volinsky/...
BellKor, Scalable Collaborative Filtering..: public.research.att.com/~volinsk...
Press et al., Numerical Recipes in C++: http://www.amazon.com/Numerical-...
Golub & Van Loan: Matrix Computations: http://www.amazon.com/Computatio...
Watkins, Fundamentals of Matrix Computations (this is a very gentle intro to the field): http://www.amazon.com/Fundamenta...
Demmel, Applied Numeric Linear Algebra: http://www.amazon.com/Applied-Nu...
Trefethen & Bau, Numerical linear algebra: http://www.amazon.com/Numerical-...
Watkins: The Matrix Eigenvalue Problem: GR and Krylov Subspace Methods: http://www.amazon.com/Matrix-Eig...
Parlett, The Symmetric Eigenvalue Problem: http://www.amazon.com/Symmetric-...
Iverson, Algebra as a language: http://www.jsoftware.com/papers/...
Iverson, Algebra: an algorithmic treatment: http://www.amazon.com/Algebra-al...
Bertsekas, Parallel and Distributed Computation: Numerical Methods:http://www.amazon.com/Parallel-D...
Hamming, Numerical Methods for Scientists and Engineers: http://www.amazon.com/Numerical-...
Bierman, Factorization Methods for Discrete Sequential Estimation: http://www.amazon.com/Factorizat...
Wilkinson, The algebraic Eigenvalue Problem: http://www.amazon.com/Algebraic-...
Horn, Matrix Analysis: http://www.amazon.com/Matrix-Ana...
Harville, Matrix Algebra from a statistician perspective: http://www.amazon.com/gp/product...
Fiedler, Special Matrices: http://www.amazon.com/Special-Ma...
Higham, Accuracy and stability of numerical algorithms: http://www.amazon.com/gp/product...
Langville & Meyer, Google Page Rank and Beyond: http://www.amazon.com/Googles-Pa...
Nielsen, PageRank tutorial: http://michaelnielsen.org/blog/u...
Mannix, Numerical recipes in Hadoop: http://www.slideshare.net/jakema...
Godsil, Algebraic Graph Theory: http://www.amazon.com/Algebraic-...
Wheeler: On building a stupidly fast graph database: http://blog.directededge.com/200...
http://numpy.scipy.org/

2) Start learning statistics by coding with R:

Pick up some R manuals (see

What are essential references for R?) and experiment with some of these data sets: http://www.datawrangling.com/som...
and UCI Machine learning repository: http://archive.ics.uci.edu/ml/

Here is a good reference to get started with regression analysis:

Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models:
http://www.amazon.com/Analysis-R...

Albert, Bayesian computation with R:

http://www.amazon.com/Bayesian-C...

Spector, Data Manipulation with R:

http://www.amazon.com/Bayesian-C...

Gries, Quantitative corpus linguistics with R: http://www.amazon.com/Quantitati...
Duda & Hart, Pattern Classification:http://www.amazon.com/Pattern-Cl..., it is a classic book on statistical inference and a very readable intro to the field
Go through the Exploratory Data Analysis by Tukey: http://www.amazon.com/Explorator.... Read Hamming for inspiration: http://www.cs.virginia.edu/~robi...
If you want to get a job look up "statistician" or "data scientist" job specs on Twitter and see what the market wants: http://twitter.com/#search?q=sta..., http://twitter.com/#search?q=%22...
E.g. here is Netflix's definition of "data scientist" body of knowledge: http://jobs.netflix.com/DetailFl... Multivariate Regression, Logistic Regression, Support Vector Machines, Bagging, Boosting, Decision Trees, Time Series Analysis, Optimization, Stochastic Processes, Experiment Analysis, Bootstrapping, R, SAS, Python, Weka, SQL and Excel . This looks like a standard Statistics curriculum.
According to LinkedIn job posting (http://www.sanfranrecruiter.com/...) you need to know some of the following: algorithm design, information retrieval, relational databases (SQL) and non-relational databases (Hadoop/pig), big data analytics, data classification, text mining, search algorithms. This seems to be more of a CS/IR oriented role.
Learn about Palantir (http://www.palantirtech.com/), Recorded Future (https://www.recordedfuture.com/) and Lyric Semiconductor (http://www.lyricsemiconductor.com/), they make interesting products.
Subscribe to DBWorld (it's a bit noisy but worth following): http://www.cs.wisc.edu/dbworld/; Consider joining at least one of these interest groups: http://www.sigkdd.org/, http://www.sigir.org/, http://www.sigmod.org/, http://www.sigsam.org, http://www.amstat.org/, http://www.siam.org/
Choose an interesting problem to tackle, say temporal search: http://www.google.com/search?q=t...
See what interests you more, do your market research. Would you prefer working with vendor tools and do mostly modeling and reporting, or build data mining systems yourself and write a lot of code? Do you see yourself as a corporate employee, a researcher in academia or a startup founder in the future? What data interests you? Structure your curriculum based on that.

3) Learn about distributed systems and databases:

Note: this topic is not part of a standard Machine Learning track but you can probably find courses such as Distributed Systems or Parallel Programming in your CS/EE catalog. I believe it is important to learn how to work with a Linux cluster and how to design scalable distributed algorithms if you want to work with big data. It is also becoming increasingly important to be able to utilize the full power of multicore. (see http://en.wikipedia.org/wiki/Moo... , http://techresearch.intel.com/ar...)
Download Hadoop [8] and run some MapReduce jobs on your laptop in pseudo-distributed mode (see

What's the best way to come up to speed on MapReduce, Hadoop, and Hive? )

Learn about Google technology stack (MapReduce, BigTable, Dremel, Pregel, GFS, Chubby, Protobuf etc). (See

What are the most interesting Google Research papers?
also http://research.google.com/pubs/... and http://www.umiacs.umd.edu/~jimmy..., http://www.columbia.edu/~ak2834/...)

Setup account with Amazon AWS/EC2/S3/EBS and experiment with running Hadoop on a cluster with large data sets (you can use Cloudera or YDN images, but in my opinion you can better understand the system if you set it up from scratch, using the original distribution). Watch the costs.

Try out Hadoop alternatives, specifically the minimalist frameworks such as BashReduce: http://github.com/erikfrey/bashr... and CloudMapReduce: http://code.google.com/p/cloudma... (see

What are some promising open-source alternatives to Hadoop MapReduce for map/reduce? )

Run Bryan Cooper's Cloud Serving Benchmark on AWS, compare Hbase vs Cassandra performance on a small cluster (6-8 nodes): http://wiki.github.com/brianfran...
Run LINPACK benchmark: http://www.datawrangling.com/on-...
Run some experiments with MPI (http://www.mcs.anl.gov/research/...) try to implement a simple clustering algorithm (e.g http://en.wikipedia.org/wiki/K-m...) with MPI vs Hadoop/MapReduce and compare the performance, fault tolerance, ease of use etc. Learn the differences between the two approaches, and when it makes sense to use each one.
Check out Dongarra' papers: http://www.netlib.org/utk/people...
There is a new library called MPI-Mapreduce (http://www.sandia.gov/~sjplimp/m...) see how it works and how it compares to other MapReduce implementations
Run some tests with Scalapack [5], try to port one of the routines to Hadoop, compare the performance and scalability
Write your own simplified MapReduce runtime in C or any other programming language
Check out http://www.cascading.org/, http://clojure.org/ and http://github.com/bradford/infer
Learn about distributed hash tables (http://en.wikipedia.org/wiki/Dis...)
Learn about Paxos (http://en.wikipedia.org/wiki/Pax...), run some experiments with open source implementations.
Download Nutch (http://nutch.apache.org/) or Solr (http://lucene.apache.org/solr/), run a crawl on Wikipedia. Analyze the collected data with R (see item 2 above) or Python (http://www.nltk.org/)
Write you own simplified crawler/indexer, test the performance and scalability, look at the Lucene source for ideas, look at http://infolab.stanford.edu/~bac... for inspiration. You can probably build it as a term project in either Information Retrieval or Search Engines course.
Learn about prefix-sum: http://en.wikipedia.org/wiki/Pre... ,parallel matrix multiplication: http://www.cs.berkeley.edu/~yeli... ,streaming: http://infolab.stanford.edu/stream/ and BSP: http://en.wikipedia.org/wiki/Bul...
Pick one of the PGAS languages (http://en.wikipedia.org/wiki/Par...), e.g. X10 (http://en.wikipedia.org/wiki/X10..., go through the tutorials (http://ppppcourse.ning.com/forum...), run some HPC benchmarks (LU, FFT) and the examples (the streaming example in particular): see how it scales on a cluster/AWS, compare to sequential and Hadoop/MapReduce implementation, see what kind of performance/scalability gains it gives you on multicore boxes.
Some good references on parallel programming: Herlihy& Shavit, The art of multiprocessor programming: http://www.amazon.com/Art-Multip... , Blelloch, Vector models for data-parallel computing: http://citeseerx.ist.psu.edu/vie... , Valiant, A bridging model for parallel computation: http://portal.acm.org/citation.c... ,Hillis & Steele, Data Parallel Algorithms: http://portal.acm.org/citation.c...
Take a course in Parallel Computer Architecture: http://www.eecs.berkeley.edu/~cu...
Check out Cilk: http://software.intel.com/en-us/...
Run some experiments with Weka (http://www.cs.waikato.ac.nz/ml/w...) or RapidMiner (http://rapid-i.com/), pick a simple algorithm and port it to MapReduce, see how it scales on a cluster/AWS
Experiment with distributed 'NoSQL' data stores (Voldemort, Hbase, Redis, Tokyo, Cassandra etc). Figure out what is CAP theorem all about (http://www.allthingsdistributed....). Create a simple app with key-value or column-based store as a back-end. Import several GBs of interesting data into it and run some simple clustering/KNN algos (http://en.wikipedia.org/wiki/Clu..., http://en.wikipedia.org/wiki/Nea...). Optimize your algo to better utilize random access patterns, experiment with various tuning options. Build a frond-end visualization for the results (Check out Protovis or similar visualization package: http://vis.stanford.edu/protovis/)
A good resource on 'NoSQL': Varley, No Relation: The Mixed Blessings of Non-Relational Databases: http://ianvarley.com/UT/MR/Varle...
Learn about main-memory databases: http://en.wikipedia.org/wiki/In-... , http://scholar.google.com/schola..., http://monetdb.cwi.nl/
Write a distributed hash table in C, here is a good reference: http://pdos.csail.mit.edu/papers...
Write a distributed file system in C. Learn how to write good systems code using the following resources:

http://swtch.com/~rsc/
http://herpolhode.com/rob/
http://www.cs.princeton.edu/~bwk/
http://cm.bell-labs.com/who/dmr/
http://www.cs.columbia.edu/~aho/
http://plan9.bell-labs.com/who/ken/
http://www.informatik.uni-trier....

4) Learn about data compression
To be added
5) Learn about machine learning

This is an excellent resource for self-study: Cross, Learning about machine learning: http://measuringmeasures.com/blo... , also http://metaoptimize.com/qa/quest...
The alternative (and rather expensive) option is to enroll in a CS program/Machine Learning track if you prefer studying in a formal setting.
Since all the standard machine learning, data mining, IR, statistics, AI, NLP content is available online, can be forked on github or purchased on Amazon I personally don't see much value in studying for a Masters degree unless you want a corporate job afterwards.
See: Was your Master's in Computer Science (MS CS) degree worth it and why? , When is it a good idea to get an MS in Computer Science? , Was your Master's degree in Statistics/Applied Math/Symbolic systems worth it and why? What are the advantages and disadvantages of doing a CS PhD?
[Higher Education] Which are the best universities for an MS or PhD related to Information Retrieval, and why?
See Lorica, How to nurture data scientists: http://practicalquant.blogspot.c...
You can structure your study program according to online course catalogs and curricula of MIT (http://web.mit.edu/catalog/degre..., http://ocw.mit.edu/courses/elect...), Stanford (http://www.stanford.edu/dept/reg...) or other top engineering schools. Experiment with data a lot, hack some code, ask questions, talk to good people, set up a web crawler in your garage (http://www.ngoprekweb.com/2006/1...).
Joining a well-capitalized data-driven startup and learning by doing (with some part-time self-study using the resources above) could be a good option. See

What are the hottest startups in the analytics space?
Who are the best VCs in the field of analytics / data mining / databases?
Which companies have the best data science teams?
What are the notable startups in the news space?
Does the US Census have a data team?
Why do so many data geeks join web companies instead of solving large scale data problems in biology?

6) Learn about least-squares estimation and Kalman filters:

This is a classic topic and "data science" par excellence in my opinion. It is also a good introduction to optimization and control theory. Start with Bierman's LLS tutorial given to his colleagues at JPL, it is clearly written and is inspiring (the Apollo mission trajectory was estimated using these methods): http://www.amazon.com/Factorizat... , also see Curkendall & Leondes: http://adsabs.harvard.edu/full/1974CeMec...8..481C and Quarles: http://citeseerx.ist.psu.edu/vie....
See Steven Kay's series on statistical signal estimation: http://www.amazon.com/Fundamenta..., also check out his short course outline at University of Rhode Island for a list of interesting topics to learn (this is usually part of EE curricula): http://www.ele.uri.edu/faculty/k...

7) Check out these Q&A:

What are the best blogs about data?
What are the best Twitter accounts about data?
What are the best blogs about bioinformatics?
What are the best Twitter accounts about bioinformatics?
What is data science?
What are the best courses at MIT?
What are the best resources to learn about web crawling and scraping?
What are the best interview questions to evaluate a machine learning researcher?
What are the best resources for learning about distributed file systems?
What are some useful packages for working with large datasets in R?
What are some good books on stringology and pattern matching?
What's a good introductory machine learning text?
What is the best book to pick up working knowledge of theoretical statistics (assuming strong general math)?
Can anyone recommend a fantastic book on time series analysis?
What are the standard texts on linear regression?
What are some good books on random processes?
How has BigTable evolved since the 2006 Google paper?
What is a good source for learning about Bayesian networks?
What are the best data visualizations ever created?
What are some of the prediction and risk estimation models used by insurance companies?
How do scientists share data?
What are the best quant hedge funds?
What are the best books on econometrics?
What are the best introductory books on mathematical finance?
What is the best approach for text categorization?
What are the numbers that every engineer should know, according to Jeff Dean?

If you do decide to go for a Masters degree:

8) Study Engineering - I'd go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a "data scientist" you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 2 above) or take some statistics classes as a part of your CS studies.

Good luck.

[1] http://mahout.apache.org/
[2] http://www.netlib.org/lapack/
[3] http://www.netlib.org/eispack/
[4] http://math.nist.gov/javanumeric...
[5] http://www.netlib.org/scalapack/
[6] http://labs.google.com/papers/ma...
[7] http://www.r-project.org/
[8] http://hadoop.apache.org/

7 Comments • Wed Aug 25 18:08:10 UTC+0800 2010

Peter Skomoroch, Sr. Data Scientist @ Linkedin - ... 19 endorsements

12 votes by Alex Kamil, Lakshmi Narasimhan Parthasarathy, Mat Kelcey, (more)

If you have the time to take courses, give it a shot.

1) Try to take some of the undergrad math courses you missed. Linear Algebra, Advanced Calculus, Diff. Eq., Probability, Statistics are the most important. After that, take some Machine Learning courses. Read a few of the leading ML textbooks and keep up with journals to get a good sense of the field.

2) Read up on what the top data companies are doing. After 1 or 2 machine learning courses you should have enough background to follow most of the academic papers. Implement some of these algorithms on real data.

3) If you are working with large datasets, get familiar with the latest techniques & tools (Hadoop, NoSQL, R, etc.) by putting them into practice at work (or outside of work).

Read these posts by Mike Driscoll:

* http://dataspora.com/blog/the-se...
* http://dataspora.com/blog/sexy-d...

Fri Sep 3 05:17:29 UTC+0800 2010

Joseph Misiti

6 votes by Charlie Cheever, Edwin Khoo, Mei Marker, (more)

I am currently working as a data engineer with a team of others and I can tell you what we all have in common:

1) MS or PhDs in Applied Mathematics or Electrical Engineering
2) Fluency C++/Matlab/Python
3) Experience building distributed systems and algorithms.

I agree with Anon that CS is probably not the way to go unless you are going to MIT, Caltech, Stanford, CMU, etc. The way I ended up in the field was working as a software engineer designing real-time systems and getting a MS in Applied Math part-time. After 4 years I had skills from both fields and was offered a position doing ML/DM. With that said, I can tell you that its an extremely interesting field, and it appears the skill set will only become more desirable in the future.

2 Comments • Thu Aug 26 10:24:51 UTC+0800 2010

Gregory Piatetsky, analytics/data mining consultant... 1 endorsement

5 votes by Peter Skomoroch, Susheel Kiran J, Carlos Leiva Burotto, (more)

A good start for becoming a data scientist is to get MS (or PhD) in Machine Learning / Data Mining - along the way you will get plenty of experience in relevant math and use latest systems. Stanford, UCI, CMU, MIT are top schools, but there are many others in USA - see
http://www.kdnuggets.com/educati... and in Europe
http://www.kdnuggets.com/educati...

Stanford has online courses in data mining / ML - check
http://www.kdnuggets.com/2010/06...
http://scpd.stanford.edu/

Thu Sep 9 02:08:55 UTC+0800 2010

Russell Jurney, Data Viznik, Hack Historian 2 endorsements

4 votes by Alex Kamil, Simplicio Gamboa III, Luis Alberto Santana and Mat Kelcey

The school route is well covered. This is the autodidactic route:

Look at some common problems solved with machine learning. Look at problems in your areas of interest with an abundance of available data. Intersect these sets, pick a problem to solve with ML. Learn whatever it takes to solve it poorly. Get people using the output of your model. Iterate, learn more techniques. Work on your maths as needed. Find mentors to talk with about problems you're working on. Keep them updated, collaborate, learn from them.

Get good at building things with data. Update your LinkedIn profile - congratulations, you're a data scientist!

Thu Sep 2 07:26:18 UTC+0800 2010

Paco Nathan, 45 years ago I couldn't even spe... 4 endorsements

4 votes by Joey Shurtleff, Edwin Khoo, Alex Kamil and Josh Wills

Stanford has an interdisciplinary degree specifically for data science, called Mathematical and Computational Sciences (MCS). It's sponsored by the Stats department and overlaps with CS, Math, Operations Research, etc. http://www.stanford.edu/group/ma... The BS degree dovetails particularly well with a co-term program to get an MS in Computer Science -- say, with a distributed systems specialization.

+1 to both Pete's and Russ' wise words above.

1 Comment • Wed Sep 8 11:57:34 UTC+0800 2010

Yaniv Goldenrand, Fraud and credit modeling

3 votes by Alex Kamil, Kevin Li and Seb Paquet

Get a job doing it, this way you'll learn what really matters and get paid in the process.
The standard way to become a data analyst is master's in math/statistics + internship.

Other ways are:
- PhD in some empirical subject (economics, psychology).
- Get an engineering position in some data-intensive company and convert.
Some of the best modelers I know are ex-programmers.

Thu Aug 26 07:33:52 UTC+0800 2010

Sandro Saitta

1 vote by Alex Kamil

Reading data mining related blogs is also important to understand the wide application areas of data mining. You have a list of data mining blogs here: http://www.dataminingblog.com/li...

1 Comment • Fri Sep 24 18:29:28 UTC+0800 2010

Xuehua Shen

1) infrastructure of data processing, such as Hadoop/MapReduce, Pig/Hive, and automation/cron.
2) simple stats about data, such as mean, correlation, and p-value.
3) algorithms for data modeling, such as logistic regression, and SVM.
4) visualization of data, such as chart and table.

Mon Sep 13 01:57:29 UTC+0800 2010

转载于:https://www.cnblogs.com/sxfmol/archive/2010/09/27/1836806.html

你可能感兴趣的:(Becoming a data scientist)

Python文件与格式化：编程世界的“读写之道“（技术深挖版）被窝妄想家 python进阶指南 python 数据库开发语言
一、文件操作：Python的"读写之眼"1.1文件基础哲学在计算机世界中，文件就像一本本等待翻阅的典籍。Python的open()函数如同手持放大镜，让我们能精确控制阅读和书写：#经典打开模式组合withopen("data.txt","r+",encoding="utf-8")asf:#r+模式：可读可写，文件指针初始位置在开头content=f.read(10)#读取前10个字节f.seek(
数据结构双向链表的创建与初始化拉梅洛. 数据结构链表
#include#include#include//定义节点类型typedefintdata_t;typedefstructnode{data_tdata;//以整型数据为例structnode*prev;//指向structnode点的指针structnode*next;//指向structnode点的指针}node_t;intdlist_create(node_t**,data_t);//函数
VUE-Element-UI：select-tree johnrui FrontEnd vue.js
一、概述本文主要是在Element-UI+VUE框架下，利用el-select、el-tree组件实现了下拉框多选、回显的效果，如下图：二、实例代码1.HTML代码2.JS代码varvm=newVue({el:'#app',data:{mineStatus:"",mineStatusValue:[],remarksItemCheckedList:[],//回显数据["A","B"]remarksI
Springboot List集合的校验方式 johnrui spring boot list 后端
pom.xml引入org.hibernate.validatorhibernate-validator6.2.0.Finalorg.springframework.bootspring-boot-starter-validation校验实体类注解@Data@NoArgsConstructor@AllArgsConstructor@JsonIgnoreProperties(ignoreUnknown
在R中读入h5ad文件，并转换为seurat对象拜托啦！狮子 r语言前端 javascript
太可恶了要么就报错要么就卡住！！！！/(ㄒoㄒ)/~~library(Seurat)library(SeuratDisk)pbmc10kmono=paste0(path,'/pbmc10k/use_data/rna_mono.h5ad')1.Round1##方法1：通过h5Seurat中转#library(SeuratDisk)#Convert(pbmc10kmono,dest="h5seurat
Python连接StarRocks全流程实践: SQL文件调用与Pandas混合优化 ToreanonyTang python sql pandas 数据库开发语言
文章目录一环境准备与连接方法1.安装核心依赖库2.连接字符串配置3.多模式连接验证二SQL文件调用与动态执行1.外部SQL文件结构设计2.Python动态加载执行三Pandas混合使用技巧1.查询结果直接转DataFrame2.批量数据写入优化四深度性能优化策略1.StarRocks服务端优化2.Python客户端优化3.混合计算策略五完整业务场景示例1:用户转化漏斗业务场景实现代码公用表表达式(
Linux系统中安装各种常用中间件 Vic2334 运维 linux 中间件运维
Linux安装docker安装docker定制软件源yuminstall-yyum-utilsdevice-mapper-persistent-datalvm2yum-config-manager--add-repohttp://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo安装最新版dockeryumlistdocker-ce--
分布式中间件：Redisson 入门和分布式锁顾北辰20 分布式中间件分布式中间件 redisson
分布式中间件：Redisson入门和分布式锁在分布式系统的开发中，处理并发问题是一个常见且具有挑战性的任务。为了确保数据的一致性和完整性，我们常常需要使用分布式锁。Redisson作为一个强大的分布式Java驻内存数据网格（In-MemoryDataGrid）中间件，为我们提供了简单且高效的分布式锁解决方案。本文将带你入门Redisson，并介绍如何使用它实现分布式锁。1.引入Redisson依赖
uniapp 微信小程序手机号快速验证组件解密 encryptedData 获取手机号睡不着的可乐 uni-app 微信小程序
uniapp微信小程序手机号快速验证组件解密encryptedData获取手机号手机号快速验证组件该能力旨在帮助开发者向用户发起手机号申请，并且必须经过用户同意后，开发者才可获得由平台验证后的手机号，进而为用户提供相应服务。以下是旧版本组件使用指南，注意使用旧版本组件时，需先调用wx.login接口。建议开发者使用新版本组件，以增强小程序安全性。详情新版组件使用指南。因为需要用户主动触发才能发起手
深度学习与目标检测系列(三) 本文约(4万字) | 全面解读复现AlexNet | Pytorch | 小酒馆燃着灯深度学习目标检测 pytorch AlexNet 人工智能
文章目录解读Abstract-摘要翻译精读主要内容1.Introduction—前言翻译精读主要内容：本文主要贡献：2.TheDataset-数据集翻译精读主要内容：ImageNet简介：图像处理方法：3.TheArchitecture—网络结构3.1ReLUNonlinearity—非线性激活函数ReLU翻译精读传统方法及不足本文改进方法本文的改进结果3.2TrainingonMultipleG
稳定运行的以Microsoft Azure SQL database数据库为数据源和目标的ETL性能变差时提高性能方法和步骤 weixin_30777913 etl azure etl 云计算数据库
在以MicrosoftAzureSQLDatabase为数据源和目标的ETL（Extract,Transform,Load）过程中，性能问题可能会随着数据量的增加、查询复杂度的提升或系统负载的加重而逐渐变差。提高以MicrosoftAzureSQLDatabase为数据源和目标的ETL性能需要综合考虑数据库查询优化、数据加载策略、并行处理、资源管理等方面。通过合适的索引、查询优化、批量处理、增量加
MySQL基本语句冉冉柟 mysql 数据库 oracle
一、DDL（数据定义语言）DDL主要用于定义数据库、表、视图、索引等数据库对象的结构1.1创建数据库CREATEDATABASEdatabase_name;1.2删除数据库DROPDATABASEdatabase_name;1.3选择数据库USEdatabase_name;1.4创建表CREATETABLEtable_name( column1datatypeconstraint, column2
Matplotlib 内置的170种颜色映射（colormap）数据分析师Weiss 数据分析 Python matplotlib 数据可视化 python 颜色映射热力图
Matplotlib提供了许多内置的颜色映射（colormap）选项，可以将数值数据映射到色彩范围——热力图、温度图、地图等可视化经常会用到。#colormap有两种引用形式plt.imshow(data,cmap='Blues')plt.imshow(data,cmap=cm.Blues)颜色映射可以分为连续的（Continuous）和离散的（Discrete）两大类。前者适用于连续数据，颜色映
Deepseek-R1-Distill-Llama-8B + Unsloth 中文医疗数据微调实战 LuckyAnJo LLM相关 llama python 自然语言处理人工智能
内容参考至博客与Bin_Nong1.环境搭建主要依赖的库(我的版本)：torch==2.5.1unsloth==2025.2.15trl==0.15.2transformers==4.49.0datasets=3.3.1wandb==0.19.62.数据准备-medical_o1_sft_Chinese经过gpt-o1的包含cot(思考过程)的中文医疗问答数据，格式与内容如下:"Question"
【收藏】如何优雅的在 Python matplotlib 中可视化矩阵，以及cmap色带设置 Think Spatial 空间思维 Python骚操作合集 python matplotlib 可视化矩阵 cmap
有时需要将numpy矩阵绘制出来看趋势，这时候可以使用plt.imshow()方法来可视化同时还需要对cmap进行设置，使用不同的色带，达到更好的可视化效果。代码importnumpyasnpfrommatplotlibimportpyplotaspltdata2D=np.random.random((50,50)
prometheus使用alertmanager实现报警功能平凡似水的人生监控系列运维 linux 监控类
前言在运维工作中，最重要的事情就是监控，监控中最重要的就是报警功能，这样可以使我们收到告警之后及时处理，以免事态发展到无可挽回的地步，下面就给大家分享一下prometheus中的告警如何实现吧。一、安装altermanager1、解压安装包tarzxfalertmanager-0.21.0.linux-amd64.tar.gz-C/data/#查看是否安装成功cd/data/alertmanage
微软Data Formulator：用AI重塑数据可视化的未来几道之旅人工智能智能体及数字员工人工智能信息可视化
在数据驱动的时代，如何快速将复杂数据转化为直观的图表是每个分析师面临的挑战。微软研究院推出的开源工具DataFormulator，通过结合AI与交互式界面，重新定义了数据可视化的工作流。本文将深入解析这一工具的核心功能、安装方法及使用技巧，助你轻松驾驭数据之美。一、DataFormulator是什么？DataFormulator是一款基于大语言模型（LLM）的AI工具，旨在帮助用户通过自然语言和界
element plus table树形数据，增、删、改子节点数据时，进行局部刷新，而不刷新整个页面 catino vue.js javascript elementui
...constlistLoading=ref(false)//保存节点映射的Mapconstmaps=reactive(newMap())constload=async(row,treeNode,resolve)=>{constpid=row.idmaps.set(pid,{row,treeNode,resolve})constpost_data={parent_id:row.id,}listL
uni-app 设置背景图在手机中无效 catino uni-app
如下写法在微信开发者工具中显示正常，但在真机调试下，手机端背景图并未显示内容文字exportdefault{data(){return{imageBgURL:'../../static/imageBg.png'};}}解决方案如下：1，将图片转为base64编码2，将图片文件上传至服务器，使用网络地址3，使用image标签替代，如文本内容.textBg{height:114rpx;width:62
QT中Xml及查看调试中容器的内部数据苜柠 QT qt
voidChuankouUI::writeFile(){QFilefile(filePath);if(!file.open(QIODevice::WriteOnly)){emiterrData("打开配置文件失败");return;}QDomDocumentdoc;//添加根节点QDomElementroot=doc.createElement("config");doc.appendChild(
DataGridView使用方法汇总 weixin_33933118 操作系统数据库 ui
DataGridView控件DataGridView是用于WindowsFroms2.0的新网格控件。它能够代替先前版本号中DataGrid控件，它易于使用并高度可定制，支持许多我们的用户须要的特性。关于本文档：本文档不准备面面俱到地介绍DataGridView，而是着眼于深入地介绍一些技术点的高级特性。本文档按逻辑分为5个章节，首先是结构和特性的概览，其次是内置的列/单元格类型的介绍，再次是数据
探索Astra DB与LangChain的集成：从向量存储到对话历史 eahba 数据库 langchain python
技术背景介绍AstraDB是DataStax推出的一款无服务器的向量数据库，基于ApacheCassandra®构建，并通过易于使用的JSONAPI提供服务。AstraDB的独特之处在于其强大的向量存储能力，这在处理自然语言处理任务时尤为突出。LangChain与AstraDB的集成为开发者提供了强大的工具链，从数据存储到语义缓存，再到自查询检索，帮助简化复杂的数据操作。核心原理解析LangCha
kotlin基础淮山2 kotlin
//Kotlin1.3.11编译器版本//无包声明importkotlin.experimental.ExperimentalUnsignedTypes//定义数据类A1，类型前置dataclassA1(valrepresentation:UInt){//这里可以添加数据类的其他方法或属性，但当前仅包含一个属性}funmain(){//1.集中声明变量，类型前置，符合C语言风格的变量声明习惯//无
COMP 315: Cloud Computing for E-Commerce 后端
Assignment1:JavascriptCOMP315:CloudComputingforE-CommerceFebruary20251IntroductionAcommontaskwhenbackendprogrammingisdatacleaning,whichistheprocessoftakinganinitialdatasetthatmaycontainerroneousorinco
uniapp特有生命周期钩子浪裡遊 uniapp uni-app vue.js 前端
生命周期钩子在UniApp中，页面的生命周期与Vue的生命周期钩子紧密相关，并且针对小程序平台，UniApp还扩展了一些额外的生命周期钩子。以下是重要的页面生命周期钩子及其简要说明：基础的Vue生命周期钩子beforeCreate在实例初始化之后，数据观测(dataobserver)和event/watcher事件配置之前被调用。created实例已经创建完成之后被调用。此时已完成数据观测，属性和
如何更优雅构建对象？我梦见你梦见我° java 开发语言
1.使用Lombok的@Builder注解Lombok的@Builder是一种非常简洁且强大的工具，可以自动生成Builder模式的代码。它避免了手动编写大量样板代码，并且支持链式调用和不可变对象的设计。@Data@NoArgsConstructor@AllArgsConstructor@BuilderpublicclassPerson{privateStringname;privateintag
table合并行花归去 vue3 element vue.js javascript elementui
{{scope.row.gdLength/10}}importtype{TableColumnCtx}from'element-plus';consttableData=[{"id":6140,"projectId":1306,"projectName":"","sectionId":12985,"sectionName":"YYZQ-9标","tunnelId":96160,"tunnelNam
Java 基础数据类型代码先锋者 java开发 java 开发语言
一、引言在Java中每个变量都必须先声明其数据类型，才能使用（即Java是强类型语言）。Java的数据类型分为两大类：基本数据类型（PrimitiveDataTypes）和引用数据类型（ReferenceDataTypes）。二、基本数据类型分类Java有8种基本数据类型（如下图所示），可分为四大类（整数型，浮点型，字符型和布尔型）：8大基本数据类型具体位数、取值范围和默认值等如下表所示：数据类型
Linux内核srio驱动,Zynq—Linux移植学习笔记（十四）：RapidIO驱动开发 weixin_39942572 Linux内核srio驱动
#defineDRIVER_NAME"xiic-rio"#defineSRIO_ZYNQ_BASEADDR0x40000000#defineSRIO_ZYNQ_NODE_BASEADDR0x10100#defineSRIO_ZYNQ_MAX_HOPCOUNT13structxiic_rio{structmutexlock;u8*data;};/*Weneedglobalvarriableforma
C语言，记录一次局部变量被意外修改的问题三日沐水嵌入式全套学习教程 c语言
背景：单片机开发过程中，我在函数体内（begin_face_record）定义了一个局部变量data_length，在使用的时候，该局部变量一直别改变，每次调用其他函数，例如c库里面的函数memcpy，不知什么情况data_length值就会被改变。1、源码分析voidmain(void){init_gpio();init_face();face_power_up();begin_face_rec
算法单链的创建与删除换个号韩国红果果 c 算法
先创建结构体 struct student { int data; //int tag;//标记这是第几个 struct student *next; }; // addone 用于将一个数插入已从小到大排好序的链中 struct student *addone(struct student *h,int x){ if(h==NULL) //??????
《大型网站系统与Java中间件实践》第2章读后感白糖_ java中间件
断断续续花了两天时间试读了《大型网站系统与Java中间件实践》的第2章，这章总述了从一个小型单机构建的网站发展到大型网站的演化过程---整个过程会遇到很多困难，但每一个屏障都会有解决方案，最终就是依靠这些个解决方案汇聚到一起组成了一个健壮稳定高效的大型系统。看完整章内容，
zeus持久层spring事务单元测试 deng520159 java DAO spring jdbc
今天把zeus事务单元测试放出来,让大家指出他的毛病, 1.ZeusTransactionTest.java 单元测试 package com.dengliang.zeus.webdemo.test; import java.util.ArrayList; import java.util.List; import org.junit.Test; import
Rss 订阅开发周凡杨 html xml 订阅 rss 规范
RSS是 Really Simple Syndication的缩写（对rss2.0而言，是这三个词的缩写，对rss1.0而言则是RDF Site Summary的缩写，1.0与2.0走的是两个体系）。 RSS
分页查询实现 g21121 分页查询
在查询列表时我们常常会用到分页，分页的好处就是减少数据交换，每次查询一定数量减少数据库压力等等。按实现形式分前台分页和服务器分页：前台分页就是一次查询出所有记录，在页面中用js进行虚拟分页，这种形式在数据量较小时优势比较明显，一次加载就不必再访问服务器了，但当数据量较大时会对页面造成压力，传输速度也会大幅下降。服务器分页就是每次请求相同数量记录，按一定规则排序，每次取一定序号直接的数据
spring jms异步消息处理 510888780 jms
spring JMS对于异步消息处理基本上只需配置下就能进行高效的处理。其核心就是消息侦听器容器，常用的类就是DefaultMessageListenerContainer。该容器可配置侦听器的并发数量，以及配合MessageListenerAdapter使用消息驱动POJO进行消息处理。且消息驱动POJO是放入TaskExecutor中进行处理，进一步提高性能，减少侦听器的阻塞。具体配置如下：
highCharts柱状图布衣凌宇 hightCharts 柱图
第一步：导入 exporting.js,grid.js,highcharts.js;第二步：写controller @Controller@RequestMapping(value="${adminPath}/statistick")public class StatistickController { private UserServi
我的spring学习笔记2-IoC（反向控制依赖注入） aijuans spring mvc Spring 教程 spring3 教程 Spring 入门
IoC（反向控制依赖注入）这是Spring提出来了，这也是Spring一大特色。这里我不用多说，我们看Spring教程就可以了解。当然我们不用Spring也可以用IoC，下面我将介绍不用Spring的IoC。 IoC不是框架，她是java的技术，如今大多数轻量级的容器都会用到IoC技术。这里我就用一个例子来说明：如：程序中有 Mysql.calss 、Oracle.class 、SqlSe
TLS java简单实现 antlove java ssl keystore tls secure
1. SSLServer.java package ssl; import java.io.FileInputStream; import java.io.InputStream; import java.net.ServerSocket; import java.net.Socket; import java.security.KeyStore; import
Zip解压压缩文件百合不是茶 Zip格式解压 Zip流的使用文件解压
ZIP文件的解压缩实质上就是从输入流中读取数据。Java.util.zip包提供了类ZipInputStream来读取ZIP文件,下面的代码段创建了一个输入流来读取ZIP格式的文件; ZipInputStream in = new ZipInputStream(new FileInputStream(zipFileName)); &n
underscore.js 学习（一） bijian1013 JavaScript underscore
工作中需要用到underscore.js，发现这是一个包括了很多基本功能函数的js库，里面有很多实用的函数。而且它没有扩展 javascript的原生对象。主要涉及对Collection、Object、Array、Function的操作。学
java jvm常用命令工具——jstatd命令(Java Statistics Monitoring Daemon) bijian1013 java jvm jstatd
1.介绍 jstatd是一个基于RMI（Remove Method Invocation）的服务程序，它用于监控基于HotSpot的JVM中资源的创建及销毁，并且提供了一个远程接口允许远程的监控工具连接到本地的JVM执行命令。 jstatd是基于RMI的，所以在运行jstatd的服务
【Spring框架三】Spring常用注解之Transactional bit1129 transactional
Spring可以通过注解@Transactional来为业务逻辑层的方法(调用DAO完成持久化动作)添加事务能力，如下是@Transactional注解的定义： /* * Copyright 2002-2010 the original author or authors. * * Licensed under the Apache License, Version
我(程序员)的前进方向 bitray 程序员
作为一个普通的程序员,我一直游走在java语言中,java也确实让我有了很多的体会.不过随着学习的深入,java语言的新技术产生的越来越多,从最初期的javase,我逐渐开始转变到ssh,ssi,这种主流的码农,.过了几天为了解决新问题,webservice的大旗也被我祭出来了,又过了些日子jms架构的activemq也开始必须学习了.再后来开始了一系列技术学习,osgi,restful.....
nginx lua开发经验总结 ronin47
使用nginx lua已经两三个月了，项目接开发完毕了，这几天准备上线并且跟高德地图对接。回顾下来lua在项目中占得必中还是比较大的，跟PHP的占比差不多持平了，因此在开发中遇到一些问题备忘一下 1：content_by_lua中代码容量有限制，一般不要写太多代码，正常编写代码一般在100行左右（具体容量没有细心测哈哈，在4kb左右），如果超出了则重启nginx的时候会报 too long pa
java-66-用递归颠倒一个栈。例如输入栈{1,2,3,4,5}，1在栈顶。颠倒之后的栈为{5,4,3,2,1}，5处在栈顶 bylijinnan java
import java.util.Stack; public class ReverseStackRecursive { /** * Q 66.颠倒栈。 * 题目：用递归颠倒一个栈。例如输入栈{1,2,3,4,5}，1在栈顶。 * 颠倒之后的栈为{5,4,3,2,1}，5处在栈顶。 *1. Pop the top element *2. Revers
正确理解Linux内存占用过高的问题 cfyme linux
Linux开机后，使用top命令查看，4G物理内存发现已使用的多大3.2G，占用率高达80%以上： Mem: 3889836k total, 3341868k used, 547968k free, 286044k buffers Swap: 6127608k total,&nb
[JWFD开源工作流]当前流程引擎设计的一个急需解决的问题 comsci 工作流
当我们的流程引擎进入IRC阶段的时候，当循环反馈模型出现之后，每次循环都会导致一大堆节点内存数据残留在系统内存中，循环的次数越多，这些残留数据将导致系统内存溢出，并使得引擎崩溃。。。。。。而解决办法就是利用汇编语言或者其它系统编程语言，在引擎运行时，把这些残留数据清除掉。
自定义类的equals函数 dai_lm equals
仅作笔记使用 public class VectorQueue { private final Vector<VectorItem> queue; private class VectorItem { private final Object item; private final int quantity; public VectorI
Linux下安装R语言 datageek R语言 linux
命令如下：sudo gedit /etc/apt/sources.list1、deb http://mirrors.ustc.edu.cn/CRAN/bin/linux/ubuntu/ precise/ 2、deb http://dk.archive.ubuntu.com/ubuntu hardy universesudo apt-key adv --keyserver ke
如何修改mysql 并发数(连接数)最大值 dcj3sjt126com mysql
MySQL的连接数最大值跟MySQL没关系，主要看系统和业务逻辑了方法一：进入MYSQL安装目录打开MYSQL配置文件 my.ini 或 my.cnf查找 max_connections=100 修改为 max_connections=1000 服务里重起MYSQL即可　　方法二：MySQL的最大连接数默认是100客户端登录：mysql -uusername -ppass
单一功能原则 dcj3sjt126com 面向对象的程序设计软件设计编程原则
单一功能原则[ 编辑] SOLID 原则单一功能原则开闭原则 Liskov代换原则接口隔离原则依赖反转原则查论编在面向对象编程领域中，单一功能原则（Single responsibility principle）规定每个类都应该有
POJO、VO和JavaBean区别和联系 fanmingxing VO POJO javabean
POJO和JavaBean是我们常见的两个关键字，一般容易混淆，POJO全称是Plain Ordinary Java Object / Plain Old Java Object，中文可以翻译成：普通Java类，具有一部分getter/setter方法的那种类就可以称作POJO，但是JavaBean则比POJO复杂很多，JavaBean是一种组件技术，就好像你做了一个扳子，而这个扳子会在很多地方被
SpringSecurity3.X--LDAP：AD配置 hanqunfeng SpringSecurity
前面介绍过基于本地数据库验证的方式，参考http://hanqunfeng.iteye.com/blog/1155226，这里说一下如何修改为使用AD进行身份验证【只对用户名和密码进行验证，权限依旧存储在本地数据库中】。将配置文件中的如下部分删除：
mac mysql 修改密码 IXHONG mysql
$ sudo /usr/local/mysql/bin/mysqld_safe –user=root & //启动MySQL(也可以通过偏好设置面板来启动)$ sudo /usr/local/mysql/bin/mysqladmin -uroot password yourpassword //设置MySQL密码（注意，这是第一次MySQL密码为空的时候的设置命令，如果是修改密码，还需在-
设计模式--抽象工厂模式 kerryg 设计模式
抽象工厂模式：工厂模式有一个问题就是，类的创建依赖于工厂类，也就是说，如果想要拓展程序，必须对工厂类进行修改，这违背了闭包原则。我们采用抽象工厂模式，创建多个工厂类，这样一旦需要增加新的功能，直接增加新的工厂类就可以了，不需要修改之前的代码。总结：这个模式的好处就是，如果想增加一个功能，就需要做一个实现类，
评"高中女生军训期跳楼” nannan408
首先，先抛出我的观点，各位看官少点砖头。那就是，中国的差异化教育必须做起来。孔圣人有云：有教无类。不同类型的人，都应该有对应的教育方法。目前中国的一体化教育，不知道已经扼杀了多少创造性人才。我们出不了爱迪生，出不了爱因斯坦，很大原因，是我们的培养思路错了，我们是第一要“顺从”。如果不顺从，我们的学校，就会用各种方法，罚站，罚写作业，各种罚。军
scala如何读取和写入文件内容？ qindongliang1922 java jvm scala
直接看如下代码： package file import java.io.RandomAccessFile import java.nio.charset.Charset import scala.io.Source import scala.reflect.io.{File, Path} /** * Created by qindongliang on 2015/
C语言算法之百元买百鸡 qiufeihu c 算法
中国古代数学家张丘建在他的《算经》中提出了一个著名的“百钱买百鸡问题”，鸡翁一，值钱五，鸡母一，值钱三，鸡雏三，值钱一，百钱买百鸡，问翁，母，雏各几何？代码如下： #include <stdio.h> int main() { int cock,hen,chick; /*定义变量为基本整型*/ for(coc
Hadoop集群安全性：Hadoop中Namenode单点故障的解决方案及详细介绍AvatarNode wyz2009107220 NameNode
正如大家所知，NameNode在Hadoop系统中存在单点故障问题，这个对于标榜高可用性的Hadoop来说一直是个软肋。本文讨论一下为了解决这个问题而存在的几个solution。 1. Secondary NameNode 原理：Secondary NN会定期的从NN中读取editlog，与自己存储的Image进行合并形成新的metadata image 优点：Hadoop较早的版本都自带，