HowardGe

Google Machine Learning Course NoteBook--Data Preparation and Feature Engineering in ML

Steps to Constructing Your Dataset

To construct your dataset (and before doing data transformation), you should:

Collect the raw data.
Identify feature and label sources.
Select a sampling strategy.
Split the data.

You might eventually use this many features, but it’s still better to start with fewer. Fewer features usually means fewer unnecessary complications.

Collecting Data

The Size and Quality of a Data Set:

Simple models on large data sets generally beat fancy models on small data sets.
What counts as “a lot” of data? It depends on the project
Consider taking an empirical approach and picking the option that produces the best outcome. With that mindset, a quality data set is one that lets you succeed with the business problem you care about. In other words, the data is good if it accomplishes its intended task.
Certain aspects of quality tend to correspond to better-performing models:

Reliability:

How common are label errors? For example, if your data is labeled by humans, sometimes humans make mistakes.
Are your features noisy? For example, GPS measurements fluctuate. Some noise is okay. You’ll never purge your data set of all noise. You can collect more examples too.
Is the data properly filtered for your problem? For example, should your data set include search queries from bots? If you’re building a spam-detection system, then likely the answer is yes, but if you’re trying to improve search results for humans, then no.

What makes data unreliable?

Omitted values. For instance, a person forgot to enter a value for a house’s age.
Duplicate examples. For example, a server mistakenly uploaded the same logs twice.
Bad labels. For instance, a person mislabeled a picture of an oak tree as a maple.
Bad feature values. For example, someone typed an extra digit, or a thermometer was left out in the sun.

Feature representation

How is data shown to the model?
Should you normalize numeric values?
How should you handle outliers?

Minimizing skew

Always consider what data is available to your model at prediction time. During training, use only the features that you’ll have available in serving, and make sure your training set is representative of your serving traffic.

The Golden Rule: Do unto training as you would do unto prediction. That is, the more closely your training task matches your prediction task, the better your ML system will perform.

Joining Data Logs

Types of Logs:

Transactional logs:record a specific event. For example, a transactional log might record an IP address making a query and the date and time at which the query was made.
Attribute data:contains snapshots of information
Aggregate statistics:create an attribute from multiple transactional logs.

Joining Log Sources:

When collecting data for your machine learning model, you must join together different sources to create your data set.

Leverage the user’s ID and timestamp in transactional logs to look up user attributes at time of event.
Use the transaction timestamp to select search history at time of query.

It is critical to use event timestamps when looking up attribute data. If you grab the latest user attributes, your training data will contain the values at the time of data collection, which causes training/serving skew. If you forget to do this for search history, you could leak the true outcome into your training data!

Identifying Labels and Sources

Direct vs. Derived Labels

Machine learning is easier when your labels are well-defined. The best label is a direct label of what you want to predict.
If you have click logs, realize that you’ll never see an impression without a click. You would need logs where the events are impressions, so you cover all cases in which a user sees a top search result.
Choose event data carefully to avoid cyclical or seasonal effects or to take those effects into account.

Sampling and Splitting Data

How do you select that subset? As an example, consider Google Search. At what granularity would you sample its massive amounts of data? Would you use random queries? Random sessions? Random users?

Ultimately, the answer depends on the problem: what do we want to predict, and what features do we want?

To use the feature previous query, you need to sample at the session level, because sessions contain a sequence of queries.
To use the feature user behavior from previous days, you need to sample at the user level.

Filtering for PII (Personally Identifiable Information)

If your data includes PII (personally identifiable information), you may need to filter it from your data. A policy may require you to remove infrequent features, for example.
This filtering is helpful because very infrequent features are hard to learn. But it’s important to realize that your dataset will be biased toward the head queries. At serving time, you can expect to do worse on serving examples from the tail, since these were the examples that got filtered out of your training data. Although this skew can’t be avoided, be aware of it during your analysis.

Imbalanced Data

Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.

What counts as imbalanced? The answer could range from mild to extreme, as the table below shows.

Degree of Imbalance	Proportion of minority classes
Extreme	<1%
Moderate	1%-20% of the data set
Mild	20%-40%

Why would this be problematic? With so few positives relative to negatives, the training model will spend most of its time on negative examples and not learn enough from positive ones. For example, if your batch size is 128, many batches will have no positive examples, so the gradients will be less informative.

If you have an imbalanced data set, first try training on the true distribution. If the model works well and generalizes, you’re done! If not, try downsampling and upweighting technique.

Downsampling (in this context) means training on a disproportionately low subset of the majority class examples.
Consider again our example of the fraud data set, with 1 positive to 200 negatives. We can downsample by a factor of 20, taking 1/10 negatives. Now about 10% of our data is positive, which will be much better for training our model.
Upweighting means adding an example weight to the downsampled class equal to the factor by which you downsampled.
Since we downsampled by a factor of 20, the example weight should be 20.

Why Downsample and Upweight?

It may seem odd to add example weights after downsampling. We were trying to make our model improve on the minority class – why would we upweight the majority? These are the resulting changes:

Faster convergence: During training, we see the minority class more often, which will help the model converge faster.
Disk space: By consolidating the majority class into fewer examples with larger weights, we spend less disk space storing them. This savings allows more disk space for the minority class, so we can collect a greater number and a wider range of examples from that class.
Calibration: Upweighting ensures our model is still calibrated; the outputs can still be interpreted as probabilities.

Note that it matters whether you care if the model reports a calibrated probability or not. If it doesn’t need to be calibrated, you don’t need to worry about changing the base rate.

Data Split Example

After collecting your data and sampling where needed, the next step is to split your data into training sets, validation sets, and testing sets.
A frequent technique for online systems is to split the data by time, such that you would:

Collect 30 days of data.
Train on data from Days 1-29.
Evaluate on data from Day 30.

To design a split that is representative of your data, consider what the data represents. The golden rule applies to data splits as well: the testing task should match the production task as closely as possible.

Examples when randomly split does work:

Time series data:Random splitting divides each cluster across the test/train split, providing a “sneak preview” to the model that won’t be available in production.
Groupings of data:The test set will always be too similar to the training set because clusters of similar data are in both sets. The model will appear to have better predictive power than it does.
Data with burstiness (data arriving in intermittent bursts as opposed to a continuous stream):Clusters of similar data (the bursts) will show up in both training and testing. The model will make better predictions in testing than with new data.

Make your data generation pipeline reproducible.

Say you want to add a feature to see how it affects model quality. For a fair experiment, your datasets should be identical except for this new feature. If your data generation runs are not reproducible, you can’t make these datasets.
In that spirit, make sure any randomization in data generation can be made deterministic:

Seed your random number generators (RNGs). Seeding ensures that the RNG outputs the same values in the same order each time you run it, recreating your dataset.
Use invariant hash keys. Hashing is a common way to split or sample data. You can hash each example, and use the resulting integer to decide in which split to place the example. The inputs to your hash function shouldn’t change each time you run the data generation program. Don’t use the current time or a random number in your hash, for example, if you want to recreate your hashes on demand.
The preceding approaches apply both to sampling and splitting your data.

Transforming Data

Feature engineering is the process of determining which features might be useful in training a model, and then creating those features by transforming raw data found in log files and other sources.

Reasons for Data Transformation:

Mandatory transformations for data compatibility:
1. Converting non-numeric features into numeric. You can’t do matrix multiplication on a string, so we must convert the string to some numeric representation.
2. Resizing inputs to a fixed size. Linear models and feed-forward neural networks have a fixed number of input nodes, so your input data must always have the same size. For example, image models need to reshape the images in their dataset to a fixed size.
Optional quality transformations that may help the model perform better:
1. Tokenization or lower-casing of text features.
2. Normalized numeric features (most models perform better afterwards).
3. Allowing linear models to introduce non-linearities into the feature space.

Visualize your data frequently. Consider Anscombe’s Quartet: your data can look one way in the basic statistics, and another when graphed. Before you get too far into analysis, look at your data graphically, either via scatter plots or histograms. View graphs not only at the beginning of the pipeline, but also throughout transformation. Visualizations will help you continually check your assumptions and see the effects of any major changes.

Transforming Numeric Data

You may need to apply two kinds of transformations to numeric data:

Normalizing - transforming numeric data to the same scale as other numeric data.
Bucketing - transforming numeric (usually continuous) data to categorical data.

Normalize Numeric Features

Normalization is necessary if you have very different values within the same feature (for example, city population). Without normalization, your training could blow up with NaNs if the gradient update is too large.
You might have two different features with widely different ranges (e.g., age and income), causing the gradient descent to “bounce” and slow down convergence. Optimizers like Adagrad and Adam protect against this problem by creating a separate effective learning rate per feature. But optimizers can’t save you from a wide range of values within a single feature; in those cases, you must normalize.

Warning: When normalizing, ensure that the same normalizations are applied at serving time to avoid skew.
The goal of normalization is to transform features to be on a similar scale. This improves the performance and training stability of the model.
Four common normalization techniques may be useful:

scaling to a range
clipping
log scaling
z-score
-

Scaling to a range

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range—usually 0 and 1 (or sometimes -1 to +1). Use the following simple formula to scale to a range:

x^’ = (x-x_min)/(x_max-x_min)
Scaling to a range is a good choice when both of the following conditions are met:

You know the approximate upper and lower bounds on your data with few or no outliers.
Your data is approximately uniformly distributed across that range.

Feature Clipping

If your data set contains extreme outliers, you might try feature clipping, which caps all feature values above (or below) a certain value to fixed value. For example, you could clip all temperature values above 40 to be exactly 40.
You may apply feature clipping before or after other normalizations.

One simple clipping strategy is to clip by z-score to ±Nσ (for example, limit to ±3σ). Note that σ is the standard deviation.

Log Scaling

Log scaling computes the log of your values to compress a wide range to a narrow range.
x’ = log(x)
Log scaling is helpful when a handful of your values have many points, while most other values have few points. This data distribution is known as the power law distribution. Movie ratings are a good example. Most movies have very few ratings (the data in the tail), while a few have lots of ratings (the data in the head). Log scaling changes the distribution, helping to improve linear model performance.

Z-Score

Z-score is a variation of scaling that represents the number of standard deviations away from the mean. You would use z-score to ensure your feature distributions have mean = 0 and std = 1. It’s useful when there are a few outliers, but not so extreme that you need clipping.
The formula for calculating the z-score of a point, x, is as follows:

x’ = (x-mean)/dev

Summary

Normalization Technique	When to Use
Linear Scaling	When the feature is more-or-less uniformly distributed across a fixed range
Clipping	When the feature contains some extreme outliers.
Log Scaling	When the feature conforms to the power law.
Z-score	When the feature distribution does not contain extreme outliers

Bucketing

Quantile Bucketing

Creating buckets so that each have the same number of points.
Bucketing Summary
If you choose to bucketize your numerical features, be clear about how you are setting the boundaries and which type of bucketing you’re applying:

Buckets with equally spaced boundaries: the boundaries are fixed and encompass the same range (for example, 0-4 degrees, 5-9 degrees, and 10-14 degrees, or $5,000-$9,999, $10,000-$14,999, and $15,000-$19,999). Some buckets could contain many points, while others could have few or none.
Buckets with quantile boundaries: each bucket has the same number of points. The boundaries are not fixed and could encompass a narrow or wide span of values.

Bucketing with equally spaced boundaries is an easy method that works for a lot of data distributions. For skewed data, however, try bucketing with quantile bucketing.

Transforming Categorical Data

Some of your features may be discrete values that aren’t in an ordered relationship. Examples include breeds of dogs, words, or postal codes. These features are known as categorical and each value is called a category. You can represent categorical values as strings or even numbers, but you won’t be able to compare these numbers or subtract them from each other.
If the number of categories of a data field is small, such as the day of the week or a limited palette of colors, you can make a unique feature for each category. This sort of mapping is called a vocabulary.

Vocabulary

In a vocabulary, each value represents a unique feature.The model looks up the index from the string, assigning 1.0 to the corresponding slot in the feature vector and 0.0 to all the other slots in the feature vector.

Note about sparse representation

f your categories are the days of the week, you might, for example, end up representing Friday with the feature vector [0, 0, 0, 0, 1, 0, 0]. However, most implementations of ML systems will represent this vector in memory with a sparse representation. A common representation is a list of non-empty values and their corresponding indices—for example, 1.0 for the value and [4] for the index. This allows you to spend less memory storing a huge amount of 0s and allows more efficient matrix multiplication. In terms of the underlying math, the [4] is equivalent to [0, 0, 0, 0, 1, 0, 0].

Out of Vocab (OOV)

Just as numerical data contains outliers, categorical data does, as well.By using OOV, the system won’t waste time training on each of those rare colors.

Hashing

Another option is to hash every string (category) into your available index space. Hashing often causes collisions, but you rely on the model learning some shared representation of the categories in the same index that works well for the given problem.
For important terms, hashing can be worse than selecting a vocabulary, because of collisions. On the other hand, hashing doesn’t require you to assemble a vocabulary, which is advantageous if the feature distribution changes heavily over time.

Hybrid of Hashing and Vocabulary

You can take a hybrid approach and combine hashing with a vocabulary. Use a vocabulary for the most important categories in your data, but replace the OOV bucket with multiple OOV buckets, and use hashing to assign categories to buckets.

The categories in the hash buckets must share an index, and the model likely won’t make good predictions, but we have allocated some amount of memory to attempt to learn the categories outside of our vocabulary.

Note about Embeddings

Embedding is a categorical feature represented as a continuous-valued feature. Deep models frequently convert the indices from an index to an embedding.

The other transformations we’ve discussed could be stored on disk, but embeddings are different. Since embeddings are trained, they’re not a typical data transformation—they are part of the model. They’re trained with other model weights, and functionally are equivalent to a layer of weights.

What about pretrained embeddings? Pretrained embeddings are still typically modifiable during training, so they’re still conceptually part of the model.

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
swagger访问路径 igotyback swagger
Swagger2.x版本访问地址：http://{ip}:{port}/{context-path}/swagger-ui.html{ip}是你的服务器IP地址。{port}是你的应用服务端口，通常为8080。{context-path}是你的应用上下文路径，如果应用部署在根路径下，则为空。Swagger3.x版本对于Swagger3.x版本（也称为OpenAPI3）访问地址：http://{ip
html 中如何使用 uniapp 的部分方法某公司摸鱼前端 html uni-app 前端
示例代码：Documentconsole.log(window);效果展示：好了，现在就可以uni.使用相关的方法了
高级编程--XML+socket练习题 masa010 java 开发语言
1.北京华北2114.8万人上海华东2,500万人广州华南1292.68万人成都华西1417万人（1）使用dom4j将信息存入xml中（2）读取信息，并打印控制台（3）添加一个city节点与子节点（4）使用socketTCP协议编写服务端与客户端，客户端输入城市ID，服务器响应相应城市信息（5）使用socketTCP协议编写服务端与客户端，客户端要求用户输入city对象，服务端接收并使用dom4j
Python教程：一文了解使用Python处理XPath 旦莫 Python进阶 python 开发语言
目录1.环境准备1.1安装lxml1.2验证安装2.XPath基础2.1什么是XPath？2.2XPath语法2.3示例XML文档3.使用lxml解析XML3.1解析XML文档3.2查看解析结果4.XPath查询4.1基本路径查询4.2使用属性查询4.3查询多个节点5.XPath的高级用法5.1使用逻辑运算符5.2使用函数6.实战案例6.1从网页抓取数据6.1.1安装Requests库6.1.2代
四章-32-点要素的聚合彩云飘过
本文基于腾讯课堂老胡的课《跟我学Openlayers--基础实例详解》做的学习笔记，使用的openlayers5.3.xapi。源码见1032.html，对应的官网示例https://openlayers.org/en/latest/examples/cluster.htmlhttps://openlayers.org/en/latest/examples/earthquake-clusters.
DIV+CSS+JavaScript技术制作网页（旅游主题网页设计与制作）云南大理 STU学生网页设计网页设计期末网页作业 html静态网页 html5期末大作业网页设计 web大作业
️精彩专栏推荐作者主页:【进入主页—获取更多源码】web前端期末大作业：【HTML5网页期末作业(1000套)】程序员有趣的告白方式：【HTML七夕情人节表白网页制作(110套)】文章目录二、网站介绍三、网站效果▶️1.视频演示2.图片演示四、网站代码HTML结构代码CSS样式代码五、更多源码二、网站介绍网站布局方面：计划采用目前主流的、能兼容各大主流浏览器、显示效果稳定的浮动网页布局结构。网站程
关于城市旅游的HTML网页设计——(旅游风景云南 5页)HTML+CSS+JavaScript 二挡起步 web前端期末大作业 javascript html css 旅游风景
⛵源码获取文末联系✈Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业|游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作|HTML期末大学生网页设计作业，Web大学生网页HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScrip
HTML网页设计制作大作业（div+css）云南我的家乡旅游景点带文字滚动二挡起步 web前端期末大作业 web设计网页规划与设计 html css javascript dreamweaver 前端
Web前端开发技术描述网页设计题材，DIV+CSS布局制作,HTML+CSS网页设计期末课程大作业游景点介绍|旅游风景区|家乡介绍|等网站的设计与制作HTML期末大学生网页设计作业HTML：结构CSS：样式在操作方面上运用了html5和css3，采用了div+css结构、表单、超链接、浮动、绝对定位、相对定位、字体样式、引用视频等基础知识JavaScript：做与用户的交互行为文章目录前端学习路线
【目标检测数据集】卡车数据集1073张VOC+YOLO格式熬夜写代码的平头哥∰ 目标检测 YOLO 人工智能
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：1073标注数量(xml文件个数)：1073标注数量(txt文件个数)：1073标注类别数：1标注类别名称:["truck"]每个类别标注的框数：truck框数=1120总框数：1120使用标注工具：labelImg标注
钢筋长度超限检测检数据集VOC+YOLO格式215张1类别 futureflsl 数据集 YOLO 深度学习机器学习
数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：215标注数量(xml文件个数)：215标注数量(txt文件个数)：215标注类别数：1标注类别名称:["iron"]每个类别标注的框数：iron框数=215总框数：215使用标注工具：labelImg标注规则：对类别进
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
SpringBlade dict-biz/list 接口 SQL 注入漏洞文章永久免费只为良心 oracle 数据库
SpringBladedict-biz/list接口SQL注入漏洞POC:构造请求包查看返回包你的网址/api/blade-system/dict-biz/list?updatexml(1,concat(0x7e,md5(1),0x7e),1)=1漏洞概述在SpringBlade框架中，如果dict-biz/list接口的后台处理逻辑没有正确地对用户输入进行过滤或参数化查询（PreparedSta
BART&BERT Ambition_LAO 深度学习
BART和BERT都是基于Transformer架构的预训练语言模型。模型架构：BERT(BidirectionalEncoderRepresentationsfromTransformers)主要是一个编码器（Encoder）模型，它使用了Transformer的编码器部分来处理输入的文本，并生成文本的表示。BERT特别擅长理解语言的上下文，因为它在预训练阶段使用了掩码语言模型（MLM）任务，即
spring如何整合druid连接池？惜.己 spring spring junit 数据库 java idea 后端 xml
目录spring整合druid连接池1.新建maven项目2.新建mavenModule3.导入相关依赖4.配置log4j2.xml5.配置druid.xml1)xml中如何引入properties2)下面是配置文件6.准备jdbc.propertiesJDBC配置项解释7.配置druid8.测试spring整合druid连接池1.新建maven项目打开IDE（比如IntelliJIDEA,Ecl
matlab mle 优化,MLE+: Matlab Toolbox for Integrated Modeling, Control and Optimization for Buildings... Simon Zhong matlab mle 优化
摘要：FollowingunilateralopticnervesectioninadultPVGhoodedrat,theaxonguidancecueephrin-A2isup-regulatedincaudalbutnotrostralsuperiorcolliculus(SC)andtheEphA5receptorisdown-regulatedinaxotomisedretinalgan
遥感影像的切片处理 sand&wich 计算机视觉 python 图像处理
在遥感影像分析中，经常需要将大尺寸的影像切分成小片段，以便于进行详细的分析和处理。这种方法特别适用于机器学习和图像处理任务，如对象检测、图像分类等。以下是如何使用Python和OpenCV库来实现这一过程，同时确保每个影像片段保留正确的地理信息。准备环境首先，确保安装了必要的Python库，包括numpy、opencv-python和xml.etree.ElementTree。这些库将用于图像处理
入门MySQL——查询语法练习 K_un
前言：前面几篇文章为大家介绍了DML以及DDL语句的使用方法，本篇文章将主要讲述常用的查询语法。其实MySQL官网给出了多个示例数据库供大家实用查询，下面我们以最常用的员工示例数据库为准，详细介绍各自常用的查询语法。1.员工示例数据库导入官方文档员工示例数据库介绍及下载链接：https://dev.mysql.com/doc/employee/en/employees-installation.h
00. 这里整理了最全的爬虫框架（Java + Python）有一只柴犬爬虫系列爬虫 java python
目录1、前言2、什么是网络爬虫3、常见的爬虫框架3.1、java框架3.1.1、WebMagic3.1.2、Jsoup3.1.3、HttpClient3.1.4、Crawler4j3.1.5、HtmlUnit3.1.6、Selenium3.2、Python框架3.2.1、Scrapy3.2.2、BeautifulSoup+Requests3.2.3、Selenium3.2.4、PyQuery3.2
详解：如何设计出健壮的秒杀系统？夜空_2cd3
作者：Yrion博客园：cnblogs.com/wyq178/p/11261711.html前言：秒杀系统相信很多人见过，比如京东或者淘宝的秒杀，小米手机的秒杀。那么秒杀系统的后台是如何实现的呢？我们如何设计一个秒杀系统呢？对于秒杀系统应该考虑哪些问题？如何设计出健壮的秒杀系统？本期我们就来探讨一下这个问题：image目录一：****秒杀系统应该考虑的问题二：****秒杀系统的设计和技术方案三：*
RabbitMQ生产者重复机制与确认机制 java炒饭小能手 java-rabbitmq rabbitmq java
重复机制生产者发送消息时，出现了网络故障，导致与MQ的连接中断。为了解决这个问题，SpringAMQP提供的消息发送时的重试机制。即：当RabbitTemplate与MQ连接超时后，多次重试。需要修该发送端模块的application.yaml文件，添加下面的内容：spring:rabbitmq:connection-timeout:1s#设置MQ的连接超时时间template:retry:ena
yolov5＞onnx＞ncnn＞apk 图像处理大大大大大牛啊 opencv实战代码讲解 yolo onnx ncnn 安卓
一.yolov5pt模型转onnx条件：colabnotebookyolov51.安装环境!pipinstallonnx>=1.7.0#forONNXexport!pipinstallcoremltools==4.0#forCoreMLexport!pipinstallonnx-simplifier2.修改common.py在classFocus下面
使用由 Python 编写的 lxml 实现高性能 XML 解析 hunyxv python 笔记 python xml
转载自：文章lxml简介Python从来不出现XML库短缺的情况。从2.0版本开始，它就附带了xml.dom.minidom和相关的pulldom以及SimpleAPIforXML(SAX)模块。从2.4开始，它附带了流行的ElementTreeAPI。此外，很多第三方库可以提供更高级别的或更具有python风格的接口。尽管任何XML库都足够处理简单的DocumentObjectModel(DOM
设计模式之建造者模式(通俗易懂--代码辅助理解【Java版】） ok!ko 设计模式设计模式建造者模式 java
文章目录设计模式概述1、建造者模式2、建造者模式使用场景3、优点4、缺点5、主要角色6、代码示例：1）实现要求2）UML图3)实现步骤：1）创建一个表示食物条目和食物包装的接口2）创建实现Packing接口的实体类3）创建实现Item接口的抽象类，该类提供了默认的功能4）创建扩展了Burger和ColdDrink的实体类5）创建一个Meal类，带有上面定义的Item对象6）创建一个MealBuil
斟一小组鸡血视频和自己一起成长
http://m.v.qq.com/play/play.html?coverid=&vid=c0518henl2a&ptag=2_6.0.0.14297_copy有一种努力叫做靠自己http://m.v.qq.com/play/play.html?coverid=&vid=i0547o426g4&ptag=2_6.0.0.14297_copy世界最励志短片https://v.qq.com/x/pa
Dockerfile命令详解之 FROM 清风怎不知意容器化 java 前端 javascript
许多同学不知道Dockerfile应该如何写，不清楚Dockerfile中的指令分别有什么意义，能达到什么样的目的，接下来我将在容器化专栏中详细的为大家解释每一个指令的含义以及用法。专栏订阅传送门https://blog.csdn.net/qq_38220908/category_11989778.html指令不区分大小写。但是，按照惯例，它们应该是大写的，以便更容易地将它们与参数区分开来。(引用
《HTML 与 CSS—— 响应式设计》陈在天box html css 前端
一、引言在当今数字化时代，人们使用各种不同的设备访问互联网，包括智能手机、平板电脑、笔记本电脑和台式机等。为了确保网站在不同设备上都能提供良好的用户体验，响应式设计成为了网页开发的关键。HTML和CSS作为网页开发的基础技术，在实现响应式设计方面发挥着重要作用。本文将深入探讨HTML与CSS中的响应式设计原理、方法和最佳实践。二、响应式设计的概念与重要性（一）概念响应式设计是一种网页设计方法，旨在
【C语言】- 自定义类型：结构体、枚举、联合 Cavalier_01 C语言
【C语言】：操作符（https://mp.csdn.net/editor/html/115218055）数据类型（https://mp.csdn.net/editor/html/115219664）自定义类型：结构体、枚举、联合（https://mp.csdn.net/editor/html/115373785）变量、常量（https://mp.csdn.net/editor/html/11523
html+css网页设计旅游网站首页1个页面 html+css+js网页设计 html css 旅游
html+css网页设计旅游网站首页1个页面网页作品代码简单，可使用任意HTML辑软件（如：Dreamweaver、HBuilder、Vscode、Sublime、Webstorm、Text、Notepad++等任意html编辑软件进行运行及修改编辑等操作）。获取源码1，访问该网站https://download.csdn.net/download/qq_42431718/897527112，点击
[数据集][目标检测]汽车头部尾部检测数据集VOC+YOLO格式5319张3类别 FL1623863129 数据集目标检测汽车 YOLO
数据集制作单位：未来自主研究中心(FIRC)版权单位：未来自主研究中心(FIRC)版权声明：数据集仅仅供个人使用，不得在未授权情况下挂淘宝、咸鱼等交易网站公开售卖,由此引发的法律责任需自行承担数据集格式：PascalVOC格式+YOLO格式(不包含分割路径的txt文件，仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件)图片数量(jpg文件个数)：5319标注数量(xml文件
ASM系列五利用TreeApi 解析生成Class lijingyao8206 ASM 字节码动态生成 ClassNode TreeAPI
前面CoreApi的介绍部分基本涵盖了ASMCore包下面的主要API及功能，其中还有一部分关于MetaData的解析和生成就不再赘述。这篇开始介绍ASM另一部分主要的Api。TreeApi。这一部分源码是关联的asm-tree-5.0.4的版本。在介绍前，先要知道一点， Tree工程的接口基本可以完
链表树——复合数据结构应用实例 bardo 数据结构树型结构表结构设计链表菜单排序
我们清楚：数据库设计中，表结构设计的好坏，直接影响程序的复杂度。所以，本文就无限级分类（目录）树与链表的复合在表设计中的应用进行探讨。当然，什么是树，什么是链表，这里不作介绍。有兴趣可以去看相关的教材。需求简介：经常遇到这样的需求，我们希望能将保存在数据库中的树结构能够按确定的顺序读出来。比如，多级菜单、组织结构、商品分类。更具体的，我们希望某个二级菜单在这一级别中就是第一个。虽然它是最后
为啥要用位运算代替取模呢 chenchao051 位运算哈希汇编
在hash中查找key的时候，经常会发现用&取代%，先看两段代码吧， JDK6中的HashMap中的indexFor方法： /** * Returns index for hash code h. */ static int indexFor(int h, int length) {
最近的情况麦田的设计者生活感悟计划软考想
今天是2015年4月27号整理一下最近的思绪以及要完成的任务 1、最近在驾校科目二练车，每周四天，练三周。其实做什么都要用心，追求合理的途径解决。为
PHP去掉字符串中最后一个字符的方法 IT独行者 PHP 字符串
今天在PHP项目开发中遇到一个需求，去掉字符串中的最后一个字符原字符串1,2,3,4,5,6, 去掉最后一个字符","，最终结果为1,2,3,4,5,6 代码如下： $str = "1,2,3,4,5,6,"; $newstr = substr($str,0,strlen($str)-1); echo $newstr;
hadoop在linux上单机安装过程 _wy_ linux hadoop
1、安装JDK jdk版本最好是1.6以上，可以使用执行命令java -version查看当前JAVA版本号，如果报命令不存在或版本比较低，则需要安装一个高版本的JDK，并在/etc/profile的文件末尾，根据本机JDK实际的安装位置加上以下几行： export JAVA_HOME=/usr/java/jdk1.7.0_25
JAVA进阶----分布式事务的一种简单处理方法无量多系统交互分布式事务
每个方法都是原子操作：提供第三方服务的系统，要同时提供执行方法和对应的回滚方法 A系统调用B,C,D系统完成分布式事务 =========执行开始======== A.aa(); try { B.bb(); } catch(Exception e) { A.rollbackAa(); } try { C.cc(); } catch(Excep
安墨移动广告：移动DSP厚积薄发引领未来广告业发展命脉矮蛋蛋 hadoop 互联网
　　“谁掌握了强大的DSP技术，谁将引领未来的广告行业发展命脉。”2014年，移动广告行业的热点非移动DSP莫属。各个圈子都在纷纷谈论，认为移动DSP是行业突破点，一时间许多移动广告联盟风起云涌，竞相推出专属移动DSP产品。　　到底什么是移动DSP呢? 　　DSP(Demand-SidePlatform)，就是需求方平台，为解决广告主投放的各种需求，真正实现人群定位的精准广
myelipse设置 alafqq IP
在一个项目的完整的生命周期中，其维护费用，往往是其开发费用的数倍。因此项目的可维护性、可复用性是衡量一个项目好坏的关键。而注释则是可维护性中必不可少的一环。注释模板导入步骤安装方法：打开eclipse/myeclipse 选择 window-->Preferences-->JAVA-->Code-->Code
java数组百合不是茶 java数组
java数组的声明创建初始化； java支持C语言数组中的每个数都有唯一的一个下标一维数组的定义声明： int[] a = new int[3];声明数组中有三个数int[3] int[] a 中有三个数，下标从0开始，可以同过for来遍历数组中的数
javascript读取表单数据 bijian1013 JavaScript
利用javascript读取表单数据，可以利用以下三种方法获取： 1、通过表单ID属性：var a = document.getElementByIdx_x_x("id"); 2、通过表单名称属性：var b = document.getElementsByName("name"); 3、直接通过表单名字获取：var c = form.content.
探索JUnit4扩展：使用Theory bijian1013 java JUnit Theory
理论机制（Theory）一.为什么要引用理论机制（Theory）当今软件开发中，测试驱动开发（TDD — Test-driven development）越发流行。为什么 TDD 会如此流行呢？因为它确实拥有很多优点，它允许开发人员通过简单的例子来指定和表明他们代码的行为意图。 TDD 的优点： &nb
[Spring Data Mongo一]Spring Mongo Template操作MongoDB bit1129 template
什么是Spring Data Mongo Spring Data MongoDB项目对访问MongoDB的Java客户端API进行了封装，这种封装类似于Spring封装Hibernate和JDBC而提供的HibernateTemplate和JDBCTemplate，主要能力包括 1. 封装客户端跟MongoDB的链接管理 2. 文档-对象映射，通过注解:@Document(collectio
【Kafka八】Zookeeper上关于Kafka的配置信息 bit1129 zookeeper
问题： 1. Kafka的哪些信息记录在Zookeeper中 2. Consumer Group消费的每个Partition的Offset信息存放在什么位置 3. Topic的每个Partition存放在哪个Broker上的信息存放在哪里 4. Producer跟Zookeeper究竟有没有关系？没有关系！！！ //consumers、config、brokers、cont
java OOM内存异常的四种类型及异常与解决方案 ronin47 java OOM 内存异常
　OOM异常的四种类型：　　　　　一：　StackOverflowError ：通常因为递归函数引起（死递归，递归太深）。-Xss 128k 一般够用。　二：　out Of memory: PermGen Space：通常是动态类大多，比如web 服务器自动更新部署时引起。-Xmx
java-实现链表反转-递归和非递归实现 bylijinnan java
20120422更新：对链表中部分节点进行反转操作，这些节点相隔k个： 0->1->2->3->4->5->6->7->8->9 k=2 8->1->6->3->4->5->2->7->0->9 注意1 3 5 7 9 位置是不变的。解法：将链表拆成两部分： a.0-&
Netty源码学习-DelimiterBasedFrameDecoder bylijinnan java netty
看DelimiterBasedFrameDecoder的API，有举例：接收到的ChannelBuffer如下： +--------------+ | ABC\nDEF\r\n | +--------------+ 经过DelimiterBasedFrameDecoder(Delimiters.lineDelimiter())之后，得到： +-----+----
linux的一些命令 -查看cc攻击-网口ip统计等 hotsunshine linux
Linux判断CC攻击命令详解 2011年12月23日 ⁄ 安全 ⁄ 暂无评论查看所有80端口的连接数 netstat -nat|grep -i '80'|wc -l 对连接的IP按连接数量进行排序 netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -n 查看TCP连接状态 n
Spring获取SessionFactory ctrain sessionFactory
String sql = "select sysdate from dual"; WebApplicationContext wac = ContextLoader.getCurrentWebApplicationContext(); String[] names = wac.getBeanDefinitionNames(); for(int i=0; i&
Hive几种导出数据方式 daizj hive 数据导出
Hive几种导出数据方式 1.拷贝文件如果数据文件恰好是用户需要的格式，那么只需要拷贝文件或文件夹就可以。 hadoop fs –cp source_path target_path 2.导出到本地文件系统 --不能使用insert into local directory来导出数据，会报错 --只能使用
编程之美 dcj3sjt126com 编程 PHP 重构
我个人的 PHP 编程经验中，递归调用常常与静态变量使用。静态变量的含义可以参考 PHP 手册。希望下面的代码，会更有利于对递归以及静态变量的理解 header("Content-type: text/plain"); function static_function () { static $i = 0; if ($i++ < 1
Android保存用户名和密码 dcj3sjt126com android
转自：http://www.2cto.com/kf/201401/272336.html 我们不管在开发一个项目或者使用别人的项目，都有用户登录功能，为了让用户的体验效果更好，我们通常会做一个功能，叫做保存用户，这样做的目地就是为了让用户下一次再使用该程序不会重新输入用户名和密码，这里我使用3种方式来存储用户名和密码 1、通过普通的txt文本存储 2、通过properties属性文件进行存
Oracle 复习笔记之同义词 eksliang Oracle 同义词 Oracle synonym
转载请出自出处：http://eksliang.iteye.com/blog/2098861 1.什么是同义词同义词是现有模式对象的一个别名。概念性的东西，什么是模式呢？创建一个用户，就相应的创建了一个模式。模式是指数据库对象，是对用户所创建的数据对象的总称。模式对象包括表、视图、索引、同义词、序列、过
Ajax案例 gongmeitao Ajax jsp
数据库采用Sql Server2005 项目名称为:Ajax_Demo 1.com.demo.conn包 package com.demo.conn; import java.sql.Connection;import java.sql.DriverManager;import java.sql.SQLException; //获取数据库连接的类public class DBConnec
ASP.NET中Request.RawUrl、Request.Url的区别 hvt .net Web C#asp.net hovertree
如果访问的地址是：http://h.keleyi.com/guestbook/addmessage.aspx?key=hovertree%3C&n=myslider#zonemenu那么Request.Url.ToString() 的值是：http://h.keleyi.com/guestbook/addmessage.aspx?key=hovertree<&
SVG 教程（七）SVG 实例，SVG 参考手册天梯梦 svg
SVG 实例在线实例下面的例子是把SVG代码直接嵌入到HTML代码中。谷歌Chrome，火狐，Internet Explorer9，和Safari都支持。注意：下面的例子将不会在Opera运行，即使Opera支持SVG - 它也不支持SVG在HTML代码中直接使用。 SVG 实例 SVG基本形状一个圆矩形不透明矩形一个矩形不透明2 一个带圆角矩
事务管理 luyulong java spring 编程事务
事物管理 spring事物的好处为不同的事物API提供了一致的编程模型支持声明式事务管理提供比大多数事务API更简单更易于使用的编程式事务管理API 整合spring的各种数据访问抽象 TransactionDefinition 定义了事务策略 int getIsolationLevel()得到当前事务的隔离级别 READ_COMMITTED
基础数据结构和算法十一：Red-black binary search tree sunwinner Algorithm Red-black
The insertion algorithm for 2-3 trees just described is not difficult to understand; now, we will see that it is also not difficult to implement. We will consider a simple representation known
centos同步时间 stunizhengjia linux 集群同步时间
做了集群，时间的同步就显得非常必要了。以下是查到的如何做时间同步。在CentOS 5不再区分客户端和服务器，只要配置了NTP，它就会提供NTP服务。 1)确认已经ntp程序包： # yum install ntp 2)配置时间源（默认就行，不需要修改） # vi /etc/ntp.conf server pool.ntp.o
ITeye 9月技术图书有奖试读获奖名单公布 ITeye管理员 ITeye
ITeye携手博文视点举办的9月技术图书有奖试读活动已圆满结束，非常感谢广大用户对本次活动的关注与参与。 9月试读活动回顾：http://webmaster.iteye.com/blog/2118112本次技术图书试读活动的优秀奖获奖名单及相应作品如下（优秀文章有很多，但名额有限，没获奖并不代表不优秀）：《NFC：Arduino、Andro