机器学习 分类监督学习
石油和天然气数据科学 (Data science in Oil and Gas)
Facies are uniform sedimentary bodies of rock which are distinguishable enough from each other in terms of physical characteristics (e.g. sedimentary structure, grain sizes), deposited under the action of a relatively uniform hydrodynamic regime in a given depositional setting. The different types of facies based on these properties include sedimentary facies, lithofacies, seismic facies, etc.
相是均匀的岩石沉积体,在给定的沉积环境中,在相对均匀的流体动力机制的作用下沉积的物理特征(例如,沉积结构,粒度)彼此之间可以充分区分。 基于这些属性的不同类型的相包括沉积相,岩相,地震相等。
The physical and organic characteristics found in these rock units usually provide some insight into the different process and systems (e.g. depositional environments) which may have occurred in the region. Combinations of several facies with physical models and other geological data can help provide informative low-dimensional models of the geologic region, leading to better insights regarding the geology of the region.
在这些岩石单元中发现的物理和有机特征通常提供对该区域可能发生的不同过程和系统(例如沉积环境)的一些了解。 几种相与物理模型和其他地质数据的结合可以帮助提供地质地区的信息丰富的低维模型,从而获得有关该地区地质的更好见解。
In oil and gas exploration, knowledge of the depositional environment is vital as it can help in providing a decent picture of the petroleum systems involved. A petroleum system consists of source rocks (organic matter rich rocks which generate hydrocarbons if heated sufficiently), reservoir rocks (set of rock units that holds hydrocarbon reserves which migrate from the source rock), and seal rock (relatively impermeable rock units that form a barrier around reservoir rocks, preventing hydrocarbon migration beyond the reservoir). A succession of facies with sandstone units, for instance, might be an indicator of a good reservoir, as they tend to have high permeability and porosity, ideal conditions to store hydrocarbons. A common rule known as Walther’s Law of facies stipulates that vertical succession of facies reflects lateral changes in the depositional environment i.e. Two adjacent facies within the rock record must have been deposited laterally in formation. Therefore, the amount of hydrocarbon in place may be estimated by observing the lateral extent and geometries of the facies containing the reservoir units.
在石油和天然气勘探中,了解沉积环境至关重要,因为它可以帮助提供有关石油系统的真实情况。 石油系统包括烃源岩(富含有机物的岩石,如果充分加热会生成碳氢化合物),储层岩石(一组岩石单元,这些烃单元从烃源岩中迁移出来,保留有碳氢化合物储量)和封岩(相对不可渗透的岩石单元,形成了一个储层岩石周围的屏障,防止碳氢化合物迁移到储层之外。 例如,连续的砂岩单元相可能表明储层良好,因为它们往往具有高渗透率和Kong隙度,是储存碳氢化合物的理想条件。 称为沃尔特岩相定律的通用规则规定,岩相的垂直连续性反映了沉积环境的横向变化,即,岩石记录中的两个相邻相必须已经在地层中横向沉积了。 因此,可以通过观察包含储层单元的相的横向范围和几何形状来估算到位的碳氢化合物的量。
The main sources of data with regards to sub-surface rocks come from a variety of sources, most of which involve drilling. The ideal source of data for facies classification is core (rock) samples obtained from drilled wells as they allow for direct assessment of the sedimentary structure. However, these are expensive to obtain and are not always a feasible option in some cases due to costs. In these situations, indirect measurements are necessary. In this study, I will focus on wireline logs. Wireline logging involves the process of lowering instruments into a borehole and recording measurements that detail the physical characteristics of the surrounding rock and fluids with depth. It is a means of assessing reservoirs since it aids in the differentiation between oil, gas, and water containing formations, determination of porosity, as well as the approximate amount of hydrocarbons present in each formation.
有关地下岩石的主要数据来源来自多种来源,其中大部分涉及钻Kong。 相分类的理想数据来源是从钻井中获得的岩心(岩石)样本,因为它们可以直接评估沉积结构。 但是,这些方法价格昂贵,并且由于成本原因在某些情况下并不总是可行的选择。 在这些情况下,必须进行间接测量。 在这项研究中,我将专注于有线测井。 电缆测井涉及将仪器放到井眼中并记录测量数据的过程,该测量过程详细记录了周围岩石和流体的深度物理特征。 这是评估储层的一种方法,因为它有助于区分油,气和含水地层,确定Kong隙度,以及每个地层中存在的大约碳氢化合物量。
In order to assign facies using log data, an analyst is required to analyze the records and determine the approximate lithology with depth. With larger datasets, this can become quite tedious and very inefficient in terms of time. Automation of this procedure is therefore necessary and may be helpful in giving a quick and reasonably accurate picture of the regional geology. I will demonstrate how unsupervised learning can help with this process using K-means clustering.
为了使用测井数据分配相,需要分析人员分析记录并确定具有深度的近似岩性。 对于更大的数据集,这可能会变得非常乏味并且在时间方面非常低效。 因此,该过程的自动化是必要的,并且可能有助于快速而合理地准确描述区域地质情况。 我将演示使用K-means聚类的无监督学习如何帮助这一过程。
学习区 (Study area)
The analysis is focused on the northwestern region of Kansas, USA, specifically the Forest city basin. This basin has predominantly been a shallow oil and gas province with some coal production.
分析的重点是美国堪萨斯州的西北地区,特别是森林城市盆地。 这个盆地主要是一个浅层的石油和天然气省,有一些煤炭生产。
My focus was on the Nemaha region within the Forest city basin, which is known to have occurrences of oil and gas and had a couple of wells drilled there. The data was obtained courtesy of the Kansas geological survey which has public records for different wells in the region (Please see https://www.shalexp.com/kinney-oil-company for more details on wells).
我的重点是森林城市盆地内的Nemaha地区,该地区已知有石油和天然气,并在那里钻了几口井。 数据是由堪萨斯州地质调查所提供的,该调查有该地区不同井的公开记录(有关井的更多详细信息,请参阅https://www.shalexp.com/kinney-oil-company )。
The well logs were obtained in LAS file format from here and were examined for data quality and completeness. A criterion for selection was the presence of eight curves that were deemed necessary for the analysis. The table below shows the associated curves and the properties they provide about the rock formation. These values are all a function of depth measured in feet.
从LAS文件格式获得了测井这里 ,并检测数据的质量和完整性。 选择的标准是认为存在分析所需的八条曲线。 下表显示了相关的曲线以及它们提供的有关岩层的属性。 这些值都是深度(以英尺为单位)的函数。
Using a wide array of logs is useful as they not only provide information on properties such as the lithology, porosity, and electrical conductivity of a formation, but used in conjunction with each other, can boost the accuracy of lithology estimation by performing better in situations where individual logs might fail. I analyzed several of the wells and was able to obtain 6 wells with the required curves available. These curves from the different wells were then combined into a single CSV file for processing. The well names and the formation at each depth were also included. The formation information was obtained at http://www.kgs.ku.edu/Magellan/Qualified/index.html. Oil wells belonging to Kinney Oil
使用各种各样的测井非常有用,因为它们不仅可以提供有关岩性,Kong隙率和地层电导率等性质的信息,而且相互结合使用,可以通过在某些情况下表现更好而提高岩性估计的准确性。个别日志可能会失败的地方。 我分析了几口井,并获得了6条具有所需曲线的井。 然后将来自不同Kong的这些曲线合并到单个CSV文件中进行处理。 还包括每个深度的井名和地层。 形成信息可从http://www.kgs.ku.edu/Magellan/Qualified/index.html获得。 属于Kinney Oil的油井
数据预处理和清理 (Data pre-processing and cleaning)
Descriptive statistics were obtained on the raw data to determine the presence of missing values and potential outliers. Outlier removal is essential as their presence can increase variability in data, making the results of an experiment less likely to be statistically significant.
对原始数据进行描述性统计,以确定缺失值和潜在异常值的存在。 离群值删除是必不可少的,因为它们的存在会增加数据的可变性,从而使实验结果不太可能具有统计学意义。
The counts differed across the different well logs which indicates missing values for some logs. In addition, some well logs which contained no data had a value of 999 imputed instead. Normally, the issue of missing values can be addressed by imputing the mean or median of the available values. However, geology can be quite complex, and simply replacing missing values with the mean may not reflect the actual geology. The total missing values accounted for a small fraction of the total data so they were dropped instead. The Unnamed column was also dropped as it corresponded to the index for the data measurements.
不同的测井记录的计数有所不同,这表明某些测井的值缺失。 此外,某些不包含数据的测井记录的估算值为999。 通常,可以通过估算可用值的平均值或中位数来解决缺失值的问题。 但是,地质情况可能非常复杂,仅用平均值代替缺失值可能无法反映实际的地质情况。 总缺失值占总数据的一小部分,因此将其删除。 未命名列也被删除,因为它对应于数据测量的索引。
移除异常值 (Removing outliers)
So missing data have been removed, but if we pay attention to the maximum values for each well log, we will notice some situations where the maximum values are significantly greater than the mean which are red flags for outliers. Let’s visually inspect some of these data.
因此已删除了丢失的数据,但是如果我们注意每个测井曲线的最大值,则会注意到某些情况下最大值明显大于平均值,这是异常值的危险信号。 让我们直观地检查其中一些数据。
We can see some data that significantly greater than the general population for some of the logs, especially in the resistivity log (ILD). Note that values as high as 10,000 Ω/m for ILD do occur in nature for certain lithologies. But resistivity values for most rock types fall within a range, and values as high as 10000 Ω/m also encapsulates 2000 Ω/m in that range. Judging by the relatively low prevalence of values that high, removing those values may improve clustering results. Using the ILD curve as a reference point, I found keeping values that fell within the 99.95th quantile resulted in a cut off value of around 2000 Ω/m for the ILD. It was gentle enough to minimize data loss.
对于某些测井,我们可以看到一些数据显着大于总人口,尤其是在电阻率测井( ILD )中。 注意,对于某些岩性来说, ILD的值确实会高达10,000Ω/ m。 但是,大多数岩石类型的电阻率值都在一个范围内,而高达10000Ω/ m的值也封装了该范围内的2000Ω/ m。 从较高的值相对较低的流行程度来看,删除这些值可以改善聚类结果。 使用ILD曲线作为参考点,我发现将值保持在99.95分位数以内,会导致ILD的截止值约为2000Ω/ m。 它足够温和,可以最大程度地减少数据丢失。
As some lithologies may result in big spikes in certain physical measurements (e.g. ILD, CILD), the natural log was opted for to minimize the variability in measurement values. This was applied to the ILD and CILD logs due to the large range in their values. Finally, the processed logs were saved to a CSV file.
由于某些岩性可能会导致某些物理测量值(例如ILD,CILD)出现较大的峰值,因此选择自然对数以最大程度地减小测量值的可变性。 由于它们的值范围较大,因此将其应用于ILD和CILD日志。 最后,将处理后的日志保存到CSV文件。
K均值聚类 (K-means Clustering)
After pre-processing the data, it is important to do some feature selection to ensure faster training, as well as less complexity in our model and improved accuracy. In our data set, we have two logs that give very similar information; the resistivity (ILD) and conductivity logs (CILD). We can verify this by checking the correlation between the logs to see how strongly related they are.
在对数据进行预处理之后,重要的是进行一些功能选择,以确保更快的训练,以及减少模型的复杂性和提高准确性。 在我们的数据集中,我们有两个日志提供了非常相似的信息。 电阻率( ILD )和电导率测井( CILD )。 我们可以通过检查日志之间的相关性以查看它们之间的相关性来验证这一点。
features.corr() # compute correlation between the well log curves
While there are some high correlation values, for this study correlations of 0.8 and above were deemed statistically significant. I picked a high value because some logs do measure very similar properties in terms of lithology, but they also have important differences that can help differentiate lithologies apart. The logarithmic resistivity and conductivity values have a near-perfect correlation i.e they give us basically the same information. The conductivity was therefore dropped from the features list.
尽管存在一些高相关性值,但在本研究中,相关性0.8和更高被认为具有统计学意义。 我之所以选择很高的价值,是因为某些测井仪确实在岩性方面测量出非常相似的特性,但是它们也具有重要的区别,可以帮助区分岩性。 对数电阻率和电导率值具有近乎完美的相关性,即它们为我们提供了基本相同的信息。 因此,电导率已从特征列表中删除。
Next, the data were scaled using the scale module from the sklearn package. The issue with k-means clustering is we have no idea which cluster separation is the most accurate representation of our data. Therefore I employed two techniques that can give us an idea of what the optimal cluster size could be.
接下来,使用来自sklearn包的缩放模块对数据进行缩放。 k均值聚类的问题在于我们不知道哪种聚类分离最能代表我们的数据。 因此,我采用了两种技术,可以使我们了解最佳群集大小。
Let’s begin with the Elbow technique. This method runs K-means clustering for a specified number of clusters and calculates the within sum of squares i.e the sum of differences between each point in a cluster and its assigned cluster centroid. The optimal cluster is usually selected as the point in which the sum of squares is minimized and subsequent changes for increasing clusters are minimal. Let’s make an analysis for clusters ranging in size from 1 to 12
让我们从肘部技巧开始。 此方法针对指定数目的聚类运行K-均值聚类,并计算平方内的平方和,即聚类中每个点与其分配的聚类质心之间的差之和。 通常选择最佳聚类作为最小化平方和且聚类增加的后续变化最小的点。 让我们对大小从1到12的群集进行分析
wcss = [] # Store within sum of square values for each cluster sizecl_num = 12 # Total number of clusters
for i in range (1,cl_num):
kmeans = KMeans(i, random_state=10)
kmeans.fit(x_scaled)
wcss_iter = kmeans.inertia_ # calculates the wcss
wcss.append(wcss_iter)
wcss
Next, we plot these values as a function of the number of clusters used
接下来,我们将这些值绘制为使用的簇数的函数
number_clusters = range(1,cl_num)
plt.figure(figsize=(10,8))
plt.plot(number_clusters, wcss, '*-')
plt.xlabel('Number of clusters',fontsize=20)
plt.ylabel('Within-cluster Sum of Squares',fontsize=20)
Although not very apparent, we can see the curve begins to somewhat flatten around no of clusters = 5. We can further evaluate the optimal cluster size by looking at the silhouette score. The silhouette score is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Silhouette coefficients near 1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. Let’s calculate the silhouette scores for the different number of clusters starting at no of clusters = 5.
尽管不是很明显,但我们可以看到曲线在没有聚类= 5的情况下开始逐渐变平。我们可以通过查看轮廓分数来进一步评估最佳聚类大小。 轮廓分数是衡量对象与其他聚类(分离)相比其自身聚类(内聚性)的相似程度的度量。 接近1的轮廓系数表明样本距离邻近的簇很远。 值为0表示样本在两个相邻聚类之间的决策边界上或非常接近,而负值表示这些样本可能已分配给错误的聚类。 让我们计算不同数目的簇的轮廓分数,从簇数= 5开始。
from sklearn.metrics import silhouette_scorerange_n_clusters = [5,6,7,8,9] # Number of clustersfor n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(x_scaled)
silhouette_avg = silhouette_score(x_scaled, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
The silhouette score had its highest values at no of clusters = 8. The value of 6 had a very similar silhouette score but having more separation is better in facies identification. This is due to the fact that some lithologies tend to get interbedded with each other so more clusters might reveal these patterns.
在没有聚类= 8的情况下,轮廓分数具有最高值。 值6具有非常相似的轮廓分数,但在相识别方面具有更大的分离度更好。 这是由于以下事实:某些岩性倾向于相互交错,因此更多的簇可能会揭示这些模式。
Using the KMeans algorithm from the sklearn’s cluster module, the well log data were clustered into 8 groups. I utilized a 100 random initial centroid seeds to increase the probability of finding a cluster separation that best describes the data.
使用sklearn 聚类模块中的KMeans算法,测井数据被分为8组。 我利用100个随机的初始质心种子来增加找到最能描述数据的簇分离的可能性。
数据可视化和解释 (Data visualization and interpretation)
The well logs and facies as a function of depth for some of the well were plotted using codes modified from Brendan Hall.
使用Brendan Hall修改的代码绘制了一些井的测井和岩相与深度的关系。
验证结果 (Validating results)
The plots show there is a lateral variation of lithologies in this region due to variations in the facies around the same depth. We can validate these results by obtaining some knowledge of the Kansas stratigraphy. According to KGS geology, the dominant lithologies in this region are; sandstones, shales, dolomite, limestones with some silt, chert, and chalk. Analysis of core samples also suggests the presence of gypsium and pyrite amongst others. My analysis highlighted 8 facies which are reasonable results based on the known information. Different rocks have different well log responses, and these responses used in combination can give us a pretty good idea of what the underlying lithology is. Different log responses for different lithologies can be found here and here.
这些图表明,由于同一深度附近相的变化,该区域岩性存在横向变化。 我们可以通过获得有关堪萨斯地层学的一些知识来验证这些结果。 根据KGS地质 ,该地区的主要岩性为: 砂岩,页岩,白云岩,石灰岩,带有粉砂、,石和白垩。 核心样品的分析还表明,除其他外,还存在石膏和黄铁矿。 我的分析突出显示了8个相,这些相基于已知信息是合理的结果。 不同的岩石具有不同的测井曲线响应,这些响应结合使用可以使我们对潜在的岩性有一个很好的了解。 在这里和这里可以找到针对不同岩性的不同测井响应。
If we consider the Viola formation for instance, in the Woody Acres well its represented predominantly by Facies 2. It is characterized by extremely low gamma-ray values, density values between 2.5–2.8 g/cm3, DT values on the low end (47–67 µs/ft). The PE values show a jump halfway through the formation suggesting the presence of another lithology, but its overall properties suggest a dolomitic layer with some limestones inclusions in it. In the Baumgartner well, the PE values in the bottom half of the formation jump towards a value of 5 b/E, which is indicative of limestones. The top half shows similar properties to dolomite in terms of the PE value, and its decreasing Neutron porosity coupled with a relatively stable density suggests this facies to be a combination of dolomites and limestones i.e a dolomitic limestone layer. We can, therefore, attribute Facies 7 and 5 as limestone and dolomitic limestone respectively.
例如,如果我们考虑中提琴的形成,那么在伍迪亩(Woody Acres)中,其主要由相2代表。 它的特点是伽玛射线值极低,密度值在2.5-2.8 g / cm3之间,低值DT值(47-67 µs / ft)。 PE值显示出在岩层中途的跳跃,表明存在另一种岩性,但其总体性质表明其中存在一些石灰岩夹杂物的白云岩层 。 在鲍姆加特纳井中,地层下半部的PE值跃升至5 b / E的值,这表示石灰岩。 上半部分的PE值显示出与白云石相似的性质,并且其中子Kong隙率的降低以及相对稳定的密度表明该相是白云岩和石灰石的组合,即白云质石灰岩层。 因此,我们可以将相7和5分别归为石灰岩和白云质石灰岩 。
The Kinderhook formation in the Hartter well shows a jump in the gamma readings, an increase in neutron porosity, with relatively stable density readings and a much lower resistivity reading. The sudden increase in the sonic porosity in addition to the previously mentioned properties points to a shale layer for Facies 4. The Viola formation in this well also has very similar properties, but the sonic porosity values are lower, and the slightly higher resistivity values compared to the Kinderhook formation suggest the inclusion of sands, limestone, or dolomites. However, the PE values are lower than that of dolomites or limestones suggesting Facies 1 is a sandy shale layer. Finally, the low gamma values in the top parts of the Hunton formation, coupled with the drop in the neutron porosity and increase in density and PE values in the dolomitic range suggest a dolomitic sand regime for Facies 3.
哈特井中的Kinderhook地层显示出伽马读数的跳跃,中子Kong隙率的增加,密度读数相对稳定且电阻率读数低得多。 除了前面提到的特性外,声波Kong隙度的突然增加也表明了相4的页岩层 。 该井中的中提琴地层也具有非常相似的特性,但是声波Kong隙度值较低,与金德霍克地层相比,电阻率值略高,表明其中包含了沙子,石灰石或白云岩。 然而,PE值低于白云岩或石灰石,表明相1是砂质页岩层 。 最后,Hunton地层顶部的低伽马值,再加上中子Kong隙率的下降,以及在白云岩范围内密度和PE值的增加,表明相3是白云岩砂体 。
Reference to KGS geology reveals the Viola formation to be composed of fine to coarse-grained limestones and dolomites containing variable quantities of chert. It also highlights the Hunton formation to be generally composed of gray to brown, fine-grained, crystalline dolomite or limestone with minor chert in some parts. It is also slightly coarser-grained, and slightly sandy dolomite with vuggy porosity with some chert in other parts of the formation. These agree with my analysis which shows limestones and dolomites constitute the majority lithology in these formations.
对KGS地质的参考表明,中提琴的形成是由细颗粒到粗颗粒的石灰岩和白云岩组成的,这些白云岩含有数量不等的石。 它还强调了亨通地层通常由灰色至棕色,细粒,结晶白云岩或石灰石组成,某些部位有少量石。 它的粒度也稍粗一些,而砂岩状的砂岩略微沙质,Kong隙多为Kong隙,在地层的其他部分也有cher石。 这些与我的分析一致,分析表明石灰岩和白云岩构成了这些地层的主要岩性。
摘要 (Summary)
Unsupervised machine learning can be a great way to derive some quick insights from your data in a cost-effective manner and could be a guide for additional supervised work as demonstrated in this study. However, as attractive as the prospects of this technique are, it is important to be aware of the caveats associated with this method.
无监督机器学习是一种以经济有效的方式从数据中快速获取洞察力的好方法,并且可以作为本研究中演示的其他有监督工作的指南。 但是,尽管该技术的前景诱人,但重要的是要意识到与该方法相关的注意事项。
The optimal number of clusters is not easy to predict and can be subjective in most cases. Most lithologies are not completely homogenous and tend to include different rock types, so finding an optimal k value is difficult and sometimes not possible, and may require some a priori knowledge. Additionally, the clusters can change depending on the initial centroid starting location. It is also not a practical solution for situations where there is an enormous amount of data. In those cases, it is preferable to use supervised learning techniques. In future work, I will demonstrate how a convolutional neural network can be applied to facies classification, as well as evaluating its effectiveness in terms of accuracy and other metrics.
最佳群集数不容易预测,并且在大多数情况下可能是主观的。 大多数岩性不是完全同质的,并且倾向于包括不同的岩石类型,因此要找到最佳的k值是困难的,有时甚至是不可能的,并且可能需要一些先验知识。 此外,群集可能会根据初始质心起始位置发生变化。 对于存在大量数据的情况,这也不是实际的解决方案。 在那些情况下,最好使用监督学习技术。 在以后的工作中,我将演示如何将卷积神经网络应用于相分类,以及如何在准确性和其他指标方面评估其有效性。
Comments and feedback are welcome. The codes, as well as the dataset, will be provided here in the not so distant future, and my Linkedin profile can be found here. I hope you found the article interesting, this was a fun project to investigate!
欢迎发表评论和反馈。 在不久的将来会在此处提供代码和数据集,我的Linkedin个人资料可在此处找到。 希望您觉得这篇文章有趣,这是一个有趣的项目!
Ibinabo Bestmann
伊比纳博·贝斯特曼
翻译自: https://towardsdatascience.com/facies-classification-using-unsupervised-machine-learning-in-geoscience-8b33f882a4bf
机器学习 分类监督学习