近来工作不繁忙,五点钟就准时下班了,晚上回家总想折腾点什么,在一个月黑风高的晚上,突发奇想,感觉写一个安卓上面可以用用的二次封装的抓数据框架,经过对比,我选择了jsoup,基于jsoup框架进行简单的二次封装(别问我为什么选择jsoup,问就是只知道这个框架)。说干就干,我拿起了我500块钱的机械键盘,花了两个晚上,完成了这个框架。(U1S1,我是为了抓小姐姐图片才写的这个框架)
1.引入jsoup库
dependencies {
api "org.jsoup:jsoup:1.13.1"
}
2.定义好注解的内容
注解有两个,一个是作用于数据的实体类,另一个是作用于数据字段。
A.作用于数据实体类(HtmlElementField)
@Target(ElementType.FIELD)
@Retention(RetentionPolicy.RUNTIME)
public @interface HtmlElementField {
/**
* element获取的类型
*
* @return
*/
int[] types();
String[] typenames();
/**
* 是否是属性的值
*
* @return
*/
boolean isAttr() default false;
/**
* 属性的名称
*
* @return
*/
String attrName() default "";
/**
* 是否是数组
*
* @return
*/
boolean isArray() default false;
/**
* 是否在body寻找
*
* @return
*/
boolean isBody() default true;
/**
* 是否是BaseElementData对象,
* true表示的是BaseElementData对象或者List,
* false表示的是String或者List
*
* @return
*/
boolean isMultiElementData() default false;
Class> filedModelClazz() default Object.class;
}
注解的说明:
types 数组,获取element的类型,GET_ELEMENT_BY_ID 根据id获取element,GET_ELEMENTS_BY_CLASS 根据class名获取所有的element,GET_ELEMENTS_BY_TAG 根据tag名获取所有的element
typenames 数组,id名/class名/tag名
isAttr 值是否是从属性里面获取,false表示拿text
attrName 属性名,默认为""
isArray 返回的结果是不是字符串数组,true返回的是List
isBody 是否在body搜索,true表示是在body搜索,false表示在head搜索
isMultiElementData 返回的数据是否是对象(对象需要全部都是String类型),true表示的是对象,false表示的是String,是否是数组需要看isArray的值
filedModelClazz 对象的class,用于获取类名
B.作用于数据字段(HtmlElementModelKeyname)
@Target(ElementType.FIELD)
@Retention(RetentionPolicy.RUNTIME)
public @interface HtmlElementModelKeyname {
/**
* 数据的name
*
* @return
*/
String keyname();
/**
* 获取指定标签的attrs,""表示获取text,其他值表示获取属性
*
* @return
*/
String attrname();
}
注解的说明:
attrname 要获取的属性数组,获取指定标签的attrs,""表示获取text,其他值表示获取属性
keyname 组装数据的name
3.创建字段数据的类(MultiElementData)
主要用于生成json字符串。
public class MultiElementData {
public String[] keys;
public String[] values;
public MultiElementData(String[] keys, String[] values) {
this.keys = keys;
this.values = values;
}
@Override
public String toString() {
StringBuffer stringBuffer = new StringBuffer("{");
try {
boolean isNeedDeletePoint = false;
for (int i = 0; i < keys.length; i++) {
stringBuffer.append("\"");
stringBuffer.append(keys[i]);
stringBuffer.append("\"");
stringBuffer.append(":");
stringBuffer.append("\"");
stringBuffer.append(values[i]);
stringBuffer.append("\"");
stringBuffer.append(",");
isNeedDeletePoint = true;
}
if (isNeedDeletePoint) {
stringBuffer.deleteCharAt(stringBuffer.length() - 1);
}
} catch (Exception e) {
e.printStackTrace();
}
stringBuffer.append("}");
return stringBuffer.toString();
}
}
4.创建保存注解信息的实体类
A.HtmlElementFieldModel
public class HtmlElementFieldModel {
public Field field;
public HtmlElementField annotation;
}
B.HtmlElementModelKeynameModel
public class HtmlElementModelKeynameModel {
public Field field;
public HtmlElementModelKeyname annotation;
}
5.定义获取element的方式(JsoupConstans)
public class JsoupConstans {
//根据id获取element
public static final int GET_ELEMENT_BY_ID = 0;
//根据class获取element
public static final int GET_ELEMENTS_BY_CLASS = 1;
//根据tag获取element
public static final int GET_ELEMENTS_BY_TAG = 2;
//根据select语句获取element
public static final int GET_ELEMENTS_BY_ATTRVALUE = 3;
}
6.编写一个jsoup的帮助类(JsoupUtil)
用于操作jsoup的api
public class JsoupUtil {
private Document document;
private String url = "";
//0表示在线html,1代表本地的html文件
private int type = 0;
private Element body;
private Element head;
private String title;
public JsoupUtil(String url) throws Exception {
this(url, 0);
}
public JsoupUtil(String url, int type) throws Exception {
this.url = url;
this.type = type;
if (type == 0) {
init(url);
} else {
initLocal(url);
}
}
private void initLocal(String localpath) throws IOException {
File file = new File(localpath);
document = Jsoup.parse(file, "UTF-8");
initHtmlElement();
}
private void init(String url) throws Exception {
document = Jsoup.connect(url).get();
initHtmlElement();
}
/**
* 获取body,head,title
*/
private void initHtmlElement() {
head = document.head();
body = document.body();
title = document.title();
}
/**
* 根据id获取节点
*
* @param id
* @param isBody
* @return
*/
public Element getElementTypeById(String id, boolean isBody) {
if (isBody) {
return getBodyElementById(id);
} else {
return getHeadElementById(id);
}
}
/**
* 根据id获取节点
*
* @param id
* @param isBody
* @return
*/
public Elements getElementsTypeById(String id, boolean isBody) {
if (isBody) {
return getBodyElementsById(id);
} else {
return getHeadElementsById(id);
}
}
/**
* 根据class获取节点
*
* @param className
* @param isBody
* @return
*/
public Elements getElementsTypeByClass(String className, boolean isBody) {
if (isBody) {
return getBodyElementsByClass(className);
} else {
return getHeadElementsByClass(className);
}
}
/**
* 根据tag获取节点
*
* @param tagName
* @param isBody
* @return
*/
public Elements getElementsTypeByTag(String tagName, boolean isBody) {
if (isBody) {
return getBodyElementsByTag(tagName);
} else {
return getHeadElementsByTag(tagName);
}
}
/**
* 根据select语句来获取elements
*
* @param selectStr
* @param isBody
* @return
*/
public Elements getElementsBySelectStr(String selectStr, boolean isBody) {
if (isBody) {
return getBodyElementsBySelectStr(selectStr);
} else {
return getHeadElementsBySelectStr(selectStr);
}
}
/**
* 获取全部body节点
*
* @return
*/
public Elements getAllElementsType(boolean isBody) {
if (isBody) {
return getBodyAllElements();
} else {
return getHeadAllElements();
}
}
/**
* 根据id获取body节点
*
* @param id
* @return
*/
private Element getBodyElementById(String id) {
if (body == null) {
return null;
}
return body.getElementById(id);
}
/**
* 根据id获取body节点
*
* @param id
* @return
*/
private Elements getBodyElementsById(String id) {
if (body == null) {
return null;
}
return new Elements(body.getElementById(id));
}
/**
* 根据class获取body节点
*
* @param className
* @return
*/
private Elements getBodyElementsByClass(String className) {
if (body == null) {
return null;
}
return body.getElementsByClass(className);
}
/**
* 根据tag获取body节点
*
* @param tagName
* @return
*/
private Elements getBodyElementsByTag(String tagName) {
if (body == null) {
return null;
}
return body.getElementsByTag(tagName);
}
/**
* 根据select语句获取head的elements
*
* @param selectStr
* @return
*/
private Elements getBodyElementsBySelectStr(String selectStr) {
if (body == null) {
return null;
}
return body.select(selectStr);
}
/**
* 获取全部body节点
*
* @return
*/
private Elements getBodyAllElements() {
if (body == null) {
return null;
}
return body.getAllElements();
}
/**
* 根据id获取head节点
*
* @param id
* @return
*/
private Element getHeadElementById(String id) {
if (head == null) {
return null;
}
return head.getElementById(id);
}
/**
* 根据id获取head节点
*
* @param id
* @return
*/
private Elements getHeadElementsById(String id) {
if (head == null) {
return null;
}
return new Elements(head.getElementById(id));
}
/**
* 根据class获取head节点
*
* @param className
* @return
*/
private Elements getHeadElementsByClass(String className) {
if (head == null) {
return null;
}
return head.getElementsByClass(className);
}
/**
* 根据tag获取head节点
*
* @param tagName
* @return
*/
private Elements getHeadElementsByTag(String tagName) {
if (head == null) {
return null;
}
return head.getElementsByTag(tagName);
}
/**
* 根据select语句获取head的elements
*
* @param selectStr
* @return
*/
private Elements getHeadElementsBySelectStr(String selectStr) {
if (head == null) {
return null;
}
return head.select(selectStr);
}
/**
* 获取全部head节点
*
* @return
*/
private Elements getHeadAllElements() {
if (head == null) {
return null;
}
return head.getAllElements();
}
/**
* 释放资源
*/
public void release() {
document = null;
body = null;
head = null;
System.gc();
}
}
7.处理注解与jsoup帮助类(JsoupManager)
public class JsoupManager {
private Class clazz;
private static final String TAG = "JsoupManager";
private Field[] allFields;
private List fieldModels;
HashMap multiElementAttrsHashmap = new HashMap<>();
HashMap multiElementKeysHashmap = new HashMap<>();
public JsoupManager(Class clazz, Class>... otherClazz) {
this.clazz = clazz;
// 得到所有定义字段
allFields = clazz.getDeclaredFields();
fieldModels = getFieldsAndSort();
if (otherClazz != null) {
initMultiAttrAndNames(otherClazz);
}
}
public JsoupManager(Class clazz) {
this(clazz, null);
}
public T getDataByUrl(String url) {
T result = null;
SystemLogUtil.printSysLog("getDataByUrl", url);
JsoupUtil jsoupUtil = null;
try {
jsoupUtil = new JsoupUtil(url);
StringBuffer buffer = new StringBuffer("{");
boolean isNeedDeletePoint = false;
for (int i = 0; i < fieldModels.size(); i++) {
HtmlElementFieldModel htmlElementFieldModel = fieldModels.get(i);
//拼接json的字段名
String fieldname = htmlElementFieldModel.field.getName();
buffer.append("\"");
buffer.append(fieldname);
buffer.append("\"");
buffer.append(":");
int[] types = htmlElementFieldModel.annotation.types();
String[] names = htmlElementFieldModel.annotation.typenames();
//是否返回的是attr的值
boolean isAttr = htmlElementFieldModel.annotation.isAttr();
//attr名称
String attrName = htmlElementFieldModel.annotation.attrName();
//是否是返回字符串数组
boolean isArray = htmlElementFieldModel.annotation.isArray();
boolean isBody = htmlElementFieldModel.annotation.isBody();
boolean isMultiData = htmlElementFieldModel.annotation.isMultiElementData();
Class> filedModelClazz = htmlElementFieldModel.annotation.filedModelClazz();
ArrayList elements = getElements(jsoupUtil, types, names, isBody);
String className = filedModelClazz.getName();
String[] multiElementAttrs = multiElementAttrsHashmap.get(className);
String[] multiElementKeys = multiElementKeysHashmap.get(className);
//拼接json的值的字符串
if (!isArray) {
if (!isMultiData) {
appendValues(buffer, isAttr, attrName, elements, 0);
} else {
appendMultiData(buffer, multiElementAttrs, multiElementKeys, elements, 0);
}
isNeedDeletePoint = true;
} else {
buffer.append("[");
boolean isNeedDeletePointInside = false;
for (int j = 0; j < elements.size(); j++) {
if (!isMultiData) {
appendValues(buffer, isAttr, attrName, elements, j);
} else {
appendMultiData(buffer, multiElementAttrs, multiElementKeys, elements, j);
}
isNeedDeletePointInside = true;
}
if (isNeedDeletePointInside) {
buffer.deleteCharAt(buffer.length() - 1);
}
buffer.append("]");
buffer.append(",");
if (i == fieldModels.size() - 1) {
isNeedDeletePoint = true;
} else {
isNeedDeletePoint = false;
}
}
}
if (isNeedDeletePoint) {
buffer.deleteCharAt(buffer.length() - 1);
}
buffer.append("}");
result = GsonUtils.getInstance().getEntetyByString(buffer.toString(), clazz);
} catch (Exception e) {
e.printStackTrace();
SystemLogUtil.printSysLog("JsoupManager", e.getMessage().toString());
} finally {
//释放
if (jsoupUtil != null) {
jsoupUtil.release();
}
}
return result;
}
/**
* 拼接对象
*
* @param buffer
* @param multiElementAttrs
* @param multiElementKeys
* @param elements
* @param j
*/
private void appendMultiData(StringBuffer buffer, String[] multiElementAttrs, String[] multiElementKeys, ArrayList elements, int j) throws Exception {
String[] multiElementValues = new String[multiElementAttrs.length];
for (int i = 0; i < multiElementAttrs.length; i++) {
String multiElementAttr = multiElementAttrs[i];
if (multiElementAttr == "") {
multiElementValues[i] = elements.get(j).text();
} else {
multiElementValues[i] = elements.get(j).attr(multiElementAttr);
}
}
MultiElementData multiElementData = new MultiElementData(multiElementKeys, multiElementValues);
buffer.append(multiElementData.toString());
buffer.append(",");
}
/**
* 拼接值的字符串
*
* @param buffer
* @param isAttr
* @param attrName
* @param elements
* @param j
*/
private void appendValues(StringBuffer buffer, boolean isAttr, String attrName, ArrayList elements, int j) {
String value = "";
try {
value = isAttr ? elements.get(j).attr(attrName) : elements.get(j).text();
} catch (Exception e) {
e.printStackTrace();
}
buffer.append("\"");
buffer.append(value);
buffer.append("\"");
buffer.append(",");
}
/**
* 获取所有的目标element
*
* @param jsoupUtil
* @param types
* @param names
* @param isBody
* @return
*/
private ArrayList getElements(JsoupUtil jsoupUtil, int[] types, String[] names, boolean isBody) {
try {
if (types.length > 1) {
return getElementsMulti(jsoupUtil, types, names, isBody);
} else {
return getElementsSingle(jsoupUtil, types[0], names[0], isBody);
}
} catch (Exception e) {
e.printStackTrace();
}
return new ArrayList<>();
}
/**
* TODO 有多层级选择器的情况
*
* @param jsoupUtil
* @param types
* @param names
* @param isBody
* @return
*/
private ArrayList getElementsMulti(JsoupUtil jsoupUtil, int[] types, String[] names, boolean isBody) throws Exception {
ArrayList result = new ArrayList<>();
ArrayList temp = new ArrayList<>();
boolean isFirst = true;
for (int i = 0; i < types.length; i++) {
dealTempData(temp, jsoupUtil, types[i], names[i], isFirst, isBody);
isFirst = false;
}
result.addAll(temp);
return result;
}
private void dealTempData(ArrayList temp, JsoupUtil jsoupUtil, int type, String name, boolean isFirst, boolean isBody) throws Exception {
if (isFirst) {
//获取第一级数据
if (type == GET_ELEMENT_BY_ID) {
temp.addAll(jsoupUtil.getElementsTypeById(name, isBody));
} else if (type == GET_ELEMENTS_BY_CLASS) {
temp.addAll(jsoupUtil.getElementsTypeByClass(name, isBody));
} else if (type == GET_ELEMENTS_BY_TAG) {
temp.addAll(jsoupUtil.getElementsTypeByTag(name, isBody));
} else if (type == GET_ELEMENTS_BY_ATTRVALUE) {
temp.addAll(jsoupUtil.getElementsBySelectStr(name, isBody));
}
} else {
if (temp == null) {
temp = new ArrayList<>();
}
ArrayList tempNew = new ArrayList<>(temp);
temp.clear();
//非第一级数据
if (tempNew != null && tempNew.size() > 0) {
for (int i = 0; i < tempNew.size(); i++) {
if (type == GET_ELEMENT_BY_ID) {
temp.add(tempNew.get(i).getElementById(name));
} else if (type == GET_ELEMENTS_BY_CLASS) {
temp.addAll(tempNew.get(i).getElementsByClass(name));
} else if (type == GET_ELEMENTS_BY_TAG) {
temp.addAll(tempNew.get(i).getElementsByTag(name));
} else if (type == GET_ELEMENTS_BY_ATTRVALUE) {
temp.addAll(tempNew.get(i).select(name));
}
}
}
}
}
/**
* 只有一层筛选条件的情况下
*
* @param jsoupUtil
* @param type
* @param name
* @param isBody
* @return
*/
private ArrayList getElementsSingle(JsoupUtil jsoupUtil, int type, String name, boolean isBody) throws Exception {
ArrayList result = new ArrayList<>();
if (type == GET_ELEMENT_BY_ID) {
result.addAll(jsoupUtil.getElementsTypeById(name, isBody));
} else if (type == GET_ELEMENTS_BY_CLASS) {
result.addAll(jsoupUtil.getElementsTypeByClass(name, isBody));
} else if (type == GET_ELEMENTS_BY_TAG) {
result.addAll(jsoupUtil.getElementsTypeByTag(name, isBody));
} else if (type == GET_ELEMENTS_BY_ATTRVALUE) {
result.addAll(jsoupUtil.getElementsBySelectStr(name, isBody));
}
return result;
}
/**
* 获取全部的字段并排序
*
* @return
*/
private List getFieldsAndSort() {
List outputFieldModels = new ArrayList();
for (Field field : allFields) {
if (field.isAnnotationPresent(HtmlElementField.class)) {
HtmlElementFieldModel outputFieldModel = new HtmlElementFieldModel();
outputFieldModel.field = field;
outputFieldModel.annotation = field.getAnnotation(HtmlElementField.class);
outputFieldModels.add(outputFieldModel);
}
}
return outputFieldModels;
}
/**
* 获取全部的字段并排序
*
* @param allMdFields
* @return
*/
private List getMdFieldsAndSort(Field[] allMdFields) {
List outputFieldModels = new ArrayList();
for (Field field : allMdFields) {
if (field.isAnnotationPresent(HtmlElementModelKeyname.class)) {
HtmlElementModelKeynameModel outputFieldModel = new HtmlElementModelKeynameModel();
outputFieldModel.field = field;
outputFieldModel.annotation = field.getAnnotation(HtmlElementModelKeyname.class);
outputFieldModels.add(outputFieldModel);
}
}
return outputFieldModels;
}
/**
* 初始化
*
* @param otherClazz
*/
private void initMultiAttrAndNames(Class>[] otherClazz) {
for (int index = 0; index < otherClazz.length; index++) {
try {
Class> tempClazz = otherClazz[index];
String className = tempClazz.getName();
List fieldMdModels = getMdFieldsAndSort(tempClazz.getDeclaredFields());
String[] attrs = new String[fieldMdModels.size()];
String[] keys = new String[fieldMdModels.size()];
for (int i = 0; i < fieldMdModels.size(); i++) {
attrs[i] = fieldMdModels.get(i).annotation.attrname();
keys[i] = fieldMdModels.get(i).annotation.keyname();
}
multiElementAttrsHashmap.put(className, attrs);
multiElementKeysHashmap.put(className, keys);
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
-------------------------------------------------------------------------------------------这里是华丽的分割线-------------------------------------------------------------------------------------------
下面就到了使用的环节了
1.编写数据实体类(例子:MenuNetModel)
public class MenuNetModel {
/**
* 表示查找id为htitle的text内容(第一个)
*/
@HtmlElementField(types = GET_ELEMENT_BY_ID, typenames = "htitle")
public String title;
/**
* 表示查找id为menu-second-navi的属性为class内容(第一个)
*/
@HtmlElementField(types = GET_ELEMENT_BY_ID, typenames = "menu-second-navi", isAttr = true, attrName = "class")
public String titleTemp;
/**
* 表示查找class为caption的属性值为 href=/g/35261/的text
*/
@HtmlElementField(types = {GET_ELEMENTS_BY_CLASS, GET_ELEMENTS_BY_ATTRVALUE}, typenames = {"caption", "[href=/g/35261/]"})
public String attrText;
/**
* 表示查找class为caption的属性值为 href=/g/35261/的text数组
*/
@HtmlElementField(types = {GET_ELEMENTS_BY_CLASS, GET_ELEMENTS_BY_ATTRVALUE}, typenames = {"caption", "[href=/g/35261/]"}, isArray = true)
public List attrTexts;
/**
* 表示查找属性值为 href=/g/35261/的text
*/
@HtmlElementField(types = GET_ELEMENTS_BY_ATTRVALUE, typenames = "[href=/g/35261/]")
public String attrText1;
/**
* 表示查找属性值为 href=/g/35261/的text数组
*/
@HtmlElementField(types = GET_ELEMENTS_BY_ATTRVALUE, typenames = "[href=/g/35261/]", isArray = true)
public List attrTexts1;
/**
* 表示查找id为map_rank下面的a标签的text数组
*/
@HtmlElementField(types = {GET_ELEMENT_BY_ID, GET_ELEMENTS_BY_TAG}, typenames = {"map_rank", "a"}, isArray = true)
public List locationName;
/**
* 表示查找id为map_rank下面的a标签的属性href的值数组
*/
@HtmlElementField(types = {GET_ELEMENT_BY_ID, GET_ELEMENTS_BY_TAG}, typenames = {"map_rank", "a"}, isArray = true, isAttr = true, attrName = "href")
public List locationPath;
/**
* 表示查找class为tag_div下面的a标签的数据数组
*/
@HtmlElementField(types = {GET_ELEMENTS_BY_CLASS, GET_ELEMENTS_BY_TAG}, typenames = {"tag_div", "a"}, isArray = true,
isMultiElementData = true, filedModelClazz = ElementData.class)
public List data;
/**
* 表示查找class为tag_div下面的a标签的数据数组
*/
@HtmlElementField(types = {GET_ELEMENTS_BY_CLASS, GET_ELEMENTS_BY_TAG}, typenames = {"tag_div", "a"}, isArray = true,
isMultiElementData = true, filedModelClazz = ElementDataTemp.class)
public List dataTemp;
public static class ElementData {
/**
* 表示是获取text
*/
@HtmlElementModelKeyname(attrname = "", keyname = "name")
public String name;
/**
* 表示是获取href的属性值
*/
@HtmlElementModelKeyname(attrname = "href", keyname = "path")
public String path;
@Override
public String toString() {
return "ElementData{" +
"name='" + name + '\'' +
", path='" + path + '\'' +
'}';
}
}
public static class ElementDataTemp {
@HtmlElementModelKeyname(attrname = "", keyname = "name1")
public String name1;
@HtmlElementModelKeyname(attrname = "href", keyname = "path1")
public String path1;
@Override
public String toString() {
return "ElementDataTemp{" +
"name1='" + name1 + '\'' +
", path1='" + path1 + '\'' +
'}';
}
}
@Override
public String toString() {
return "MenuNetModel{" +
"title='" + title + '\'' +
", titleTemp='" + titleTemp + '\'' +
", attrText='" + attrText + '\'' +
", attrTexts=" + attrTexts +
", attrText1='" + attrText1 + '\'' +
", attrTexts1=" + attrTexts1 +
", locationName=" + locationName +
", locationPath=" + locationPath +
", data=" + data +
", dataTemp=" + dataTemp +
'}';
}
}
2.调用JsoupManager
A.如果有数据的isMultiElementData为true
JsoupManager jsoupManager = new JsoupManager<>(MenuNetModel.class, MenuNetModel.ElementData.class, MenuNetModel.ElementDataTemp.class);
MenuNetModel dataByUrl = jsoupManager.getDataByUrl("fullUrl");
B.如果不符合A的情况
JsoupManager manager = new JsoupManager<>(MenuNetModel .class);
MenuNetModel model = manager.getDataByUrl("fullUrl");
3.测试的html
美女图片_宅男女神
美女图片
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
至此,这个简单的jsoup二次封装框架就完成了,如果各位大佬有更好的优化建议,还望大佬们不吝赐教。