Bayes Rule is critical to machine learning because it allows us to compute any of those many many joint probabilities even if there is no data for a particular combination. But with big data, since we have enough for every combination of features, we can directly find out what class an instance belongs to by computing the expectation of Y according to the existing data for the certain combination of X.


-------------------------


With big data, when tring to find rules from features, Finding regions of high support is a little bit more difficult and is often referred to as bump hunting.


----------------------------------

Traditional means of clustering data are k-means clustering, agglomerative or hierarchical clustering, and even LSH(locality sensitive hashing).


-----------------------------------


The most popular technique for finding rules (clustering features) in data is called association rule mining (Apriori Algorithm).


----------------



Drawbacks of association rule mining

one -- Only work on high support. Classes with low support are left out(Where dcision trees are used which based on mutual information between each feature and each set of feature).But computing mutual information is costly, because it requires all the joint probabilities to be know. And is not really possible for large numbers of features.


two -- negative (implicit) rules are lost (milk and not diapers => not bear).(Where people use techniques called "Interesting Subgroup Discovery" ,looking for correlations in the data instead.)

-------------

/***********

*for long data, high dimensional data

************/

Matrix techiques (NNMF) are not the onely way of finding latent models -- that is extract classes and features from a bigraph of elements (where classes and features are in some sense interchangable).The are other probabilistic ways such as the LDA or latent direct allocation. There are other matrix techniques like SVD(S singulr value decomposition).

--usage: determine topics from a bunch of documents(find topics), find roles of people(book buyers), or do recommendation systems(such as recommending books and movies in Amazon and Netflix).


--------------------------------

The frontier of research today in web intelligence: enable machines to learn in an unsupervised manne, both the classes and the features. Bottom up learning or grounded techniques essentially are talking about things like this.


----------------------------

learn facts from text collections -- supervised (Bayesian n/w, HMM(Hidden Markov Models) ),where positional order is impportant.Other techniques like CRFs.


-----------------------------------


The combination of unification and logical inference(entailment) is called resolution.

Logical inference VS. predicate logic resolution


---------------------------------------------


inferring in Bayesian networks: junction tree algorithm is well-known.Also one can use SQL to understand what happening in Bayesian network.


---------------------------------------------

conditional random fields or Markov networks turn out to be more efficient in dealing with 'holes' compared to baysian network when extracting fact from text.


--------------------------------------


The field of deep belief networks, multilayer-feedback neural networks, temporal neural networks all these are coming together.



---------------------

Cluster: k-means, agglomerative, even LSH.


--------------------------------------


reasoning = answering queries or deriving new facts (using unification + inference = resolution). (unification: X bound to Obama)


resolution may never end: undecidability and intractability(SAT and NP).

Luckly, OWL-DL,OWL-Lite are decidable,but still intractable in worst case.Or, horn logic(undecidable but tractable).


Reasoning method: logical, as well as probabilistic (this can reasoning under uncertainty)


problem with logic: uncertainty and causality.


abductive reasoning: find the best possible answer(most likely causes) (classification is a form of abductive reasoning).


----------------


beflif networks: essentially Bayesian network and their generalizations.

the structure of network can be learnt from the data. belief network bridges the fundemental limits of logic with uncertainty.(the direction of reasoning under uncertainty is PGMs(probablistic graphical models))

Other kind of networks that merge logic and probaility are Markov logic network, conditional learning fields.


Inference in such networks can be done with SQL.



-----------------------------------

prediction: (least-squares)linear prediction


---------------------------

learning parameters:

In the world of neural networks, back propagation(nothing to do with feed back) is an algorithm which iteratively learns f .

While in more mathematically way, one can use gradient-descent and Newton's iteration...

-- work fine if x are values.

-- caveats: local minima, f may have some constrains(might not necessarily be parametrized by a single set of values).

-- for categorical data

-- -- convert to binary (red , blue, green to {0,1} {0,1} {0,1} ) (prefered when compared with numerical coding, but increase the dimenstion)

-- -- fuzzyfication: convert to R^n. (not prefered, as there is no reason why blue coded as 5 and red coded as 6 where they are more closer than with others)

-- -- don't convert, but use neighborhood-search, heuristic search, genetic algorithms...

-- -- probabilistic models, i.e. deal with probabilities instead (becoming increasing powerful and popular)


------------------------------------------

linear regression is preferred unless having some real reason, as complicated non-linear function may over-fit the data.


---------------------------------


HTM(hierarchical temporal memory, a neural model) is mathematically equivalent to a deep belief network (which is probablistic graphical model.)HTM represents an area where neural networks are coming back(architecture is uniform,able to learn a wide variety of time series patterns.)


-------------------------------------



techniques we have seen are essentially data driven pridictions.While reasoning requires one to learn rules and reason in a symbolic way. The link between how data-driven bottom-up techniques eventually give rise to higher level (top-down)symbolic reasoning is still missing.(old technique called 'blackborard')


----------------------------


choosing features(those are of highest MI)

-- costly to compute exhaustively

-- proxies: IDF; iteratively-AdaBoost,etc...