假设你有一个XML格式的数据文件。该文件包含一些空标签。要求是使用Hive解析XML数据,并将所有空值设置为默认值。
有许多方法可以将XML数据解析为hive表。其中一种解决方案是通过添加hivexmlserde jar文件然后在ROW FORMAT中使用SerDe属性。另一种方法是将XML文件作为单个字符串数据存储到配置hive的临时表中,然后使用XPATH获取每个标签的数据。
将下面的XML数据保存在本地的系统中:
<Company><Employee><Id>458790Id><Name>SameerName><Email>[email protected]Email><Address><HouseNo>105HouseNo><Street>Grand RoadStreet><City>BangaloreCity><State>KarnatakaState><Pincode>560068Pincode><Country>IndiaCountry><Passport>AvailablePassport><Visa>Visa><Contact><Mobile>9909999999Mobile><Phone>8044552266Phone>Contact>Address>Employee><Employee><Id>458791Id><Name>GoharName><Email>[email protected]Email><Address><HouseNo>485HouseNo><Street>Camac Street RoadStreet><City>MumbaiCity><State>MaharastraState><Pincode>400001Pincode><Country>IndiaCountry><Passport>AvailablePassport><Visa>Visa><Contact><Mobile>9908888888Mobile><Phone />Contact>Address>Employee>Company>
例子
创建表
CREATE EXTERNAL TABLE companyxml(xmldata STRING) LOCATION '/user/hive/companyxml/company.xml';
在此步骤中,我们创建了一个临时表,该表将XML数据存储为单个记录。存储位置为XML文件的在本地的路径。
加载数据
CREATE VIEW companyview
(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
AS SELECT
xpath(xmldata,'Company/Employee/Id/text()'),
xpath(xmldata,'Company/Employee/Name/text()'),
xpath(xmldata,'Company/Employee/Email/text()'),
xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),
xpath(xmldata,'Company/Employee/Address/Street/text()'),
xpath(xmldata,'Company/Employee/Address/City/text()'),
xpath(xmldata,'Company/Employee/Address/State/text()'),
xpath(xmldata,'Company/Employee/Address/Pincode/text()'),
xpath(xmldata,'Company/Employee/Address/Country/text()'),
xpath(xmldata,'Company/Employee/Passport/text()'),
xpath(xmldata,'Company/Employee/Visa/text()'),
xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
xpath(xmldata,'Company/Employee/Contact/Phone/text()')
FROM companyxml;
在这里,我创建了一个表示companyview的视图。该视图用于解析临时表中的每个标签值。为了获得XML的标签值,我们可以使用XPath。
让我们检查一下视图的数据:
查询数据
SELECT * FROM companyview;
["458790","458791"] ["Sameer","Gohar"] ["[email protected]","[email protected]"] ["105","485"] ["Grand Road","Camac Street Road"] ["Bangalore","Mumbai"] ["Karnataka","Maharastra"] ["560068","400001"] ["India","India"] ["Available","Available"] <span STYLE="color: #000000;"><strong>[]</strong></span> ["9909999999","9908888888"] ["8044552266"]
TIME taken: 0.41 seconds, Fetched: 1 ROW(s)
在上面的输出中,标签phone只有一个值,因为只有一名员工有电话号码, visa值是一个空数组(以高亮显示)。这意味着它有空的标签值。
上图所示的数据中,XPATH返回一个数据集。为了给空标签分配默认值,我们需要编写一个自定义配置单元UDF。
当你的XML数据文件具有空标签时,就会出现此问题。因此,在传递给XPath之前修改此XML数据,并为空标记提供任何值,如blank,NULL。在这种情况下,我们将替换为 NULL phone>。
在eclipse中创建一个Maven项目。并创建一个Java项目名为XmlEmptyParse。下面是Java代码。
UDF Java代码:
UDF Java
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.hive.ql.exec.UDF;
public class XmlEmptyParse extends UDF{
public String evaluate(String xmlData){
String replaceValue = "NULL";
Map<Integer, StringBuffer> xmlMap = new HashMap<Integer, StringBuffer>();
xmlMap.put(1, new StringBuffer(xmlData));
while (xmlMap.get(1).toString().contains("/>")) {
int index = xmlMap.get(1).toString().indexOf("/>");
String sm = xmlMap.get(1).toString().substring(0, index);
int firstIndex = sm.lastIndexOf("<");
String temp = xmlMap.get(1).toString().replace(xmlMap.get(1).toString().substring(firstIndex, index + 2),
"<" + xmlMap.get(1).toString().substring(firstIndex + 1,
index) + ">" + replaceValue + "" + xmlMap.get(1).toString().substring(firstIndex + 1, index) + ">");
xmlMap.put(1, new StringBuffer(temp));
}
return xmlMap.get(1).toString();
}
}
完成代码后将Maven项目导出为jar文件,并将其保存在本地位置。我已将其导出为XmlParseUdf-0.0.1-SNAPSHOT.jar。现在,我们完成了UDF代码。接下来,我们将使用这个jar包来使用UDF函数。
首先,需要使用以下查询将jar添加到配置单元中:
ADD JAR [created_jar_location]
就我而言,它是ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;
添加jar包
hive> ADD JAR /home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar;
Added [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar] TO class path
Added resources: [/home/NN/HadoopRepo/Hive/udf/XmlParseUdf-0.0.1-SNAPSHOT.jar]
临时函数
hive> CREATE TEMPORARY FUNCTION xmlUDF AS 'hive.udf.XmlEmptyParse';
OK
TIME taken: 0.802 seconds
接下来,为了使用我们的自定义UDF函数,需要创建一个临时函数。此临时函数应该见名知意。因此,在这里我创建了一个名为xmlUDF临时函数。
最后,我们创建了UDF也就是创建了一个临时函数。在这一步中,我们将使用在上一步中创建的临时函数。
使用函数
hive> CREATE VIEW companyview(id,name,email,houseno,street,city,state,pincode,country,passport,visa,mobile,phone)
AS SELECT
xpath(xmldata,'Company/Employee/Id/text()'),
xpath(xmldata,'Company/Employee/Name/text()'),
xpath(xmldata,'Company/Employee/Email/text()'),
xpath(xmldata,'Company/Employee/Address/HouseNo/text()'),xpath(xmldata,'Company/Employee/Address/Street/text()'),xpath(xmldata,'Company/Employee/Address/City/text()'),
xpath(xmldata,'Company/Employee/Address/State/text()'),
xpath(xmldata,'Company/Employee/Address/Pincode/text()'),xpath(xmldata,'Company/Employee/Address/Country/text()'),xpath(xmldata,'Company/Employee/Passport/text()'),
xpath(xmlUDF(xmldata),'Company/Employee/Visa/text()'),
xpath(xmldata,'Company/Employee/Contact/Mobile/text()'),
xpath(xmlUDF(xmldata),'Company/Employee/Contact/Phone/text()')
FROM companyxml;
OK
TIME taken: 0.727 seconds
在这里,我将UDF用于Visa和Phone,因为这两个都有一个空标签。现在,将存储的数据检入视图。
hive> SELECT * FROM companyview;
OK
["458790","458791"] ["Sameer","Gohar"] ["[email protected]","[email protected]"] ["105","485"] ["Grand Road","Camac Street Road"] ["Bangalore","Mumbai"] ["Karnataka","Maharastra"] ["560068","400001"] ["India","India"]["Available","Available"] <strong>["NULL","NULL"]</strong> ["9909999999","9908888888"] ["8044552266","NULL"]
TIME taken: 1.184 seconds, Fetched: 1 ROW(s)
在这里,您已经了解了如何处理XML数据文件包含的一个或多个空标签。这种方法适合将任何默认值分配给空标签。