sparksql本身并不提供安全认证机制,当前集群的安全认证主要包括sentry和ranger两大块,在通过sparksql执行建表时,sentry的权限报错'org.apache.hadoop.hive.metastore.api.MetaException: User xxx does not have privileges for CREATETABLE',然而通过hive的beeline执行就可以。
通过对sentry日志的分析,定位到sentry校验权限的代码:
public void authorize(HiveOperation hiveOp, HiveAuthzPrivileges stmtAuthPrivileges,
Subject subject, List> inputHierarchyList,
List> outputHierarchyList)
throws AuthorizationException {
if (!open) {
throw new IllegalStateException("Binding has been closed");
}
boolean isDebug = LOG.isDebugEnabled();
if(isDebug) {
LOG.debug("Going to authorize statement " + hiveOp.name() +
" for subject " + subject.getName());
}
// Check read entities
Map> requiredInputPrivileges =
stmtAuthPrivileges.getInputPrivileges();
if(isDebug) {
LOG.debug("requiredInputPrivileges = " + requiredInputPrivileges);
LOG.debug("inputHierarchyList = " + inputHierarchyList);
}
Map> requiredOutputPrivileges =
stmtAuthPrivileges.getOutputPrivileges();
if(isDebug) {
LOG.debug("requiredOuputPrivileges = " + requiredOutputPrivileges);
LOG.debug("outputHierarchyList = " + outputHierarchyList);
}
LOG.info("user: {}, required input hierarchy: {}, required output hierarchy: {}",
new Object[]{subject.getName(), inputHierarchyList, outputHierarchyList});
boolean found = false;
for (Map.Entry> entry : requiredInputPrivileges.entrySet()) {
AuthorizableType key = entry.getKey();
for (List inputHierarchy : inputHierarchyList) {
if (getAuthzType(inputHierarchy).equals(key)) {
found = true;
if (!authProvider.hasAccess(subject, inputHierarchy, entry.getValue(), activeRoleSet)) {
throw new AuthorizationException("User " + subject.getName() +
" does not have privileges for " + hiveOp.name());
}
}
}
if (!found && !key.equals(AuthorizableType.URI) && !(hiveOp.equals(HiveOperation.QUERY))
&& !(hiveOp.equals(HiveOperation.CREATETABLE_AS_SELECT))) {
throw new AuthorizationException("Required privilege( " + key.name() + ") not available in input privileges");
}
found = false;
}
for (Map.Entry> entry : requiredOutputPrivileges.entrySet()) {
AuthorizableType key = entry.getKey();
for (List outputHierarchy : outputHierarchyList) {
if (getAuthzType(outputHierarchy).equals(key)) {
found = true;
if (!authProvider.hasAccess(subject, outputHierarchy, entry.getValue(), activeRoleSet)) {
throw new AuthorizationException("User " + subject.getName() +
" does not have privileges for " + hiveOp.name());
}
}
}
if(!found && !(key.equals(AuthorizableType.URI)) && !(hiveOp.equals(HiveOperation.QUERY))) {
throw new AuthorizationException("Required privilege( " + key.name() + ") not available in output privileges");
}
found = false;
}
}
该方法的传入参数包括:
hiveOp: 当前sql的操作类型
stmtAuthPrivileges: 本次操作所需的权限集合
subject: 表示当前用户
inputHierarchyList和outputHierarchyList分别表示输入对象和输出对象,即本次sql需要访问的输入输出资源
用户的鉴权分为两步:
1. 用户是否拥有对输入对象列表的该operation对应的访问权限
2. 用户是否拥有对输出对象列表的该operation对应的访问权限
stmtAuthPrivileges包含了输入对象权限map和输出对象权限map,map的key值为一个AuthorizableType枚举对象,取值为Server,Db,Table,Column,View,URI中的一种,对于每一个AuthorizableType,至少有一个inputList或outputList与其authzType相同,此时通过Provider的hasAccess方法判断该用户是否对该对象列表拥有相应的权限。
真正校验权限的逻辑在ResourceAuthorizationProvider的doHasAccess方法中:
private boolean doHasAccess(Subject subject,
List extends Authorizable> authorizables, Set extends Action> actions,
ActiveRoleSet roleSet) {
List requestPrivileges = buildPermissions(authorizables, actions);
try {
Set groups = getGroups(subject);
Set users = Sets.newHashSet(subject.getName());
Set hierarchy = new HashSet();
for (Authorizable authorizable : authorizables) {
hierarchy.add(KV_JOINER.join(authorizable.getTypeName(), authorizable.getName()));
}
LOGGER.info("get privileges args, groups: {}, users: {}, role set: {}, authorizables: {}",
new Object[]{groups, users, roleSet, authorizables.toArray(new Authorizable[0])});
Iterable privileges = getPrivileges(groups, users, roleSet,
authorizables.toArray(new Authorizable[0]));
lastFailedPrivileges.get().clear();
for (String requestPrivilege : requestPrivileges) {
Privilege priv = privilegeFactory.createPrivilege(requestPrivilege);
for (Privilege permission : privileges) {
boolean result = permission.implies(priv, model);
LOGGER.info("user: {}, group: {}, ProviderPrivilege {}, RequestPrivilege {}, RoleSet, {}, Result {}",
new Object[]{users, groups, permission, requestPrivilege, roleSet, result});
if (result) {
return true;
}
}
}
} catch (Exception ex) {
LOGGER.error("sentry auth privilege error: {}", Throwables.getStackTraceAsString(ex));
}
lastFailedPrivileges.get().addAll(requestPrivileges);
return false;
}
sentry根据用户的组、角色从数据库中读取其拥有的权限,并与需要的权限进行比对,只有当inputHierarchyList中的所需权限都符合时,才能通过认证。
通过对sparksql执行时的sentry日志和beeline执行时的sentry日志对比发现,sparksql执行传入的inputHierarchyList中包含了欲创建表的location,此时表尚未创建,用户只有对数据库的权限,因此对该表的权限认证不能通过。
通过分析sparksql的源码,spark创建表的逻辑存在于HiveClientImpl的createTable方法中:
override def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit = withHiveState {
verifyColumnDataType(table.dataSchema)
client.createTable(toHiveTable(table, Some(userName)), ignoreIfExists)
}
这里进行了转换,将CatalogTable转换成HiveTable,CatalogTable中包含了location信息。
对这块逻辑做修改,在执行createTable时,去掉table中的location信息。
override def createTable(table: CatalogTable, ignoreIfExists: Boolean): Unit = withHiveState {
verifyColumnDataType(table.dataSchema)
val hiveTable = toHiveTable(table, Some(userName))
if (sparkConf.getBoolean("spark.sql.enable.sentry", defaultValue = false)) {
hiveTable.getTTable.getSd.setLocation(null)
}
client.createTable(hiveTable, ignoreIfExists)
}
执行sparksql时,设置spark.sql.enable.sentry为true,则sentry权限校验通过。