Mr_Limitless

01_kaggle - Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle

Kaggle Competition¶

Applying a 3D convolutional neural network to the data

Welcome everyone to my coverage of the Kaggle Data Science Bowl 2017. My goal here is that anyone, even people new to kaggle, can follow along. If you are completely new to data science, I will do my best to link to tutorials and provide information on everything you need to take part.

This notebook is my actual personal initial run through this data and my notes along the way. I am by no means an expert data analyst, statistician, and certainly not a doctor. This initial pass is not going to win the competition, but hopefully it can serve as a starting point or, at the very least, you can learn something new along with me.

This is a "raw" look into the actual code I used on my first pass, there's a ton of room for improvment. If you see something that you could improve, share it with me!

Quick introduction to Kaggle

If you are new to kaggle, create an account, and start downloading the data. It's going to take a while. I found the torrent to download the fastest, so I'd suggest you go that route. When you create an account, head to competitions in the nav bar, choose the Data Science Bowl, then head to the "data" tab. You will need to accept the terms of the competition to proceed with downloading the data.

Just in case you are new, how does all this work?

In general, Kaggle competitions will come with training and testing data for you to build a model on, where both the training and testing data comes with labels so you can fit a model. Then there will be actual "blind" or "out of sample" testing data that you will actually use your model on, which will spit out an output CSV file with your predictions based on the input data. This is what you will upload to kaggle, and your score here is what you compete with. There's always a sample submission file in the dataset, so you can see how to exactly format your output predictions.

In this case, the submission file should have two columns, one for the patient's id and another for the prediction of the liklihood that this patient has cancer, like:

id,cancer
01e349d34c02410e1da273add27be25c,0.5
05a20caf6ab6df4643644c923f06a5eb,0.5
0d12f1c627df49eb223771c28548350e,0.5
...

You can submit up to 3 entries a day, so you want to be very happy with your model, and you are at least slightly disincentivised from trying to simply fit the answer key over time. It's still possible to cheat. If you do cheat, you wont win anything, since you will have to disclose your model for any prizes.

At the end, you can submit 2 final submissions (allowing you to compete with 2 models if you like).

This current competition is a 2 stage competition, where you have to participate in both stages to win. Stage one has you competing based on a validation dataset. At the release of stage 2, the validation set answers are released and then you make predictions on a new test set that comes out at the release of this second stage.

About this specific competition

At its core, the aim here is to take the sample data, consisting of low-dose CT scan information, and predict what the liklihood of a patient having lung cancer is. Your submission is scored based on the log loss of your predictions.

The dataset is pretty large at ~140GB just in initial training data, so this can be somewhat restrictive right out of the gate. I am going to do my best to make this tutorial one that anyone can follow within the built-in Kaggle kernels

Requirements and suggestions for following along

I will be using Python 3, and you should at least know the basics of Python 3.

We will also be making use of:

Pandas for some data analysis
Matplotlib for data visualization
scikit-learn and tensorflow for machine learning and modeling.

You do not need to go through all of those tutorials to follow here, but, if you are confused, it might be useful to poke around those.

For the actual dependency installs and such, I will link to them as we go.

Alright, let's get started!

Section 1: Handling Data¶

Assuming you've downloaded the data, what exactly are we working with here? The data consists of many 2D "slices," which, when combined, produce a 3-dimensional rendering of whatever was scanned. In this case, that's the chest cavity of the patient. We've got CT scans of about 1500 patients, and then we've got another file that contains the labels for this data.

There are numerous ways that we could go about creating a classifier. Being a realistic data science problem, we actually don't really know what the best path is going to be. That's why this is a competition. Thus, we have to begin by simply trying things and seeing what happens!

I have a few theories about what might work, but my first interest was to try a 3D Convolutional Neural Network. I've never had data to try one on before, so I was excited to try my hand at it!

Before we can feed the data through any model, however, we need to at least understand the data we're working with. We know the scans are in this "dicom" format, but what is that? If you're like me, you have no idea what that is, or how it will look in Python! You can learn more about DICOM from Wikipedia if you like, but our main focus is what this will actually be in Python terms.

Luckily for us, there already exists a Python package for reading dicom files: Pydicom.

Do a pip install pydicom and pip install pandas and let's see what we've got!

import dicom # for reading dicom files
import os # for doing directory operations 
import pandas as pd # for some simple data analysis (right now, just to load in the labels data and quickly reference it)

# Change this to wherever you are storing your data:
# IF YOU ARE FOLLOWING ON KAGGLE, YOU CAN ONLY PLAY WITH THE SAMPLE DATA, WHICH IS MUCH SMALLER
data_dir = 'X:/Kaggle_Data/datasciencebowl2017/stage1/'
patients = os.listdir(data_dir)
labels_df = pd.read_csv('stage1_labels.csv', index_col=0)

labels_df.head() dicom # for reading dicom files
import os # for doing directory operations 
import pandas as pd # for some simple data analysis (right now, just to load in the labels data and quickly reference it)

# Change this to wherever you are storing your data:
# IF YOU ARE FOLLOWING ON KAGGLE, YOU CAN ONLY PLAY WITH THE SAMPLE DATA, WHICH IS MUCH SMALLER
data_dir = 'X:/Kaggle_Data/datasciencebowl2017/stage1/'
patients = os.listdir(data_dir)
labels_df = pd.read_csv('stage1_labels.csv', index_col=0)

labels_df.head()

	cancer
id
0015ceb851d7251b8f399e39779d1e7d	1
0030a160d58723ff36d73f41b170ec21	0
003f41c78e6acfa92430a057ac0b306e	0
006b96310a37b36cccb2ab48d10b49a3	1
008464bb8521d09a42985dd8add3d0d2	1

At this point, we've got the list of patients by their IDs, and their associated labels stored in a dataframe. Now, we can begin to iterate through the patience and gather their respective data. We're almost certainly going to need to do some preprocessing of this data, but we'll see.

for patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    
    # a couple great 1-liners from: https://www.kaggle.com/gzuidhof/data-science-bowl-2017/full-preprocessing-tutorial
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    print(len(slices),label)
    print(slices[0])
195 1
(0008, 0000) Group Length                        UL: 358
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0016) SOP Class UID                       UI: CT Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.7009.9004.321555830121981826540353244716
(0008, 0060) Modality                            CS: 'CT'
(0008, 103e) Series Description                  LO: 'Axial'
(0010, 0000) Group Length                        UL: 64
(0010, 0010) Patient's Name                      PN: '0015ceb851d7251b8f399e39779d1e7d'
(0010, 0020) Patient ID                          LO: '0015ceb851d7251b8f399e39779d1e7d'
(0010, 0030) Patient's Birth Date                DA: '19000101'
(0018, 0060) KVP                                 DS: ''
(0020, 0000) Group Length                        UL: 390
(0020, 000d) Study Instance UID                  UI: 2.25.51820907428519808061667399603379702974102486079290552633235
(0020, 000e) Series Instance UID                 UI: 2.25.80077681474898947928901837852399466259680906162240840687333
(0020, 0011) Series Number                       IS: '2'
(0020, 0012) Acquisition Number                  IS: '1'
(0020, 0013) Instance Number                     IS: '1'
(0020, 0032) Image Position (Patient)            DS: ['-177.500000', '-177.500000', '-384.109985']
(0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '1.000000', '0.000000']
(0020, 0052) Frame of Reference UID              UI: 2.25.39085688508687805643860350141257111202569461626559365918180
(0020, 1040) Position Reference Indicator        LO: 'SN'
(0020, 1041) Slice Location                      DS: '-384.109985'
(0028, 0000) Group Length                        UL: 200
(0028, 0002) Samples per Pixel                   US: 1
(0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
(0028, 0010) Rows                                US: 512
(0028, 0011) Columns                             US: 512
(0028, 0030) Pixel Spacing                       DS: ['0.693359', '0.693359']
(0028, 0100) Bits Allocated                      US: 16
(0028, 0101) Bits Stored                         US: 16
(0028, 0102) High Bit                            US: 15
(0028, 0103) Pixel Representation                US: 1
(0028, 0120) Pixel Padding Value                 US or SS: b'0\xf8'
(0028, 0301) Burned In Annotation                CS: 'NO'
(0028, 0303) Longitudinal Temporal Information M CS: 'MODIFIED'
(0028, 1050) Window Center                       DS: '-600'
(0028, 1051) Window Width                        DS: '1500'
(0028, 1052) Rescale Intercept                   DS: '-1024'
(0028, 1053) Rescale Slope                       DS: '1'
(7fe0, 0010) Pixel Data                          OB or OW: Array of 524288 bytes patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    
    # a couple great 1-liners from: https://www.kaggle.com/gzuidhof/data-science-bowl-2017/full-preprocessing-tutorial
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    print(len(slices),label)
    print(slices[0])
195 1
(0008, 0000) Group Length                        UL: 358
(0008, 0005) Specific Character Set              CS: 'ISO_IR 100'
(0008, 0016) SOP Class UID                       UI: CT Image Storage
(0008, 0018) SOP Instance UID                    UI: 1.3.6.1.4.1.14519.5.2.1.7009.9004.321555830121981826540353244716
(0008, 0060) Modality                            CS: 'CT'
(0008, 103e) Series Description                  LO: 'Axial'
(0010, 0000) Group Length                        UL: 64
(0010, 0010) Patient's Name                      PN: '0015ceb851d7251b8f399e39779d1e7d'
(0010, 0020) Patient ID                          LO: '0015ceb851d7251b8f399e39779d1e7d'
(0010, 0030) Patient's Birth Date                DA: '19000101'
(0018, 0060) KVP                                 DS: ''
(0020, 0000) Group Length                        UL: 390
(0020, 000d) Study Instance UID                  UI: 2.25.51820907428519808061667399603379702974102486079290552633235
(0020, 000e) Series Instance UID                 UI: 2.25.80077681474898947928901837852399466259680906162240840687333
(0020, 0011) Series Number                       IS: '2'
(0020, 0012) Acquisition Number                  IS: '1'
(0020, 0013) Instance Number                     IS: '1'
(0020, 0032) Image Position (Patient)            DS: ['-177.500000', '-177.500000', '-384.109985']
(0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '1.000000', '0.000000']
(0020, 0052) Frame of Reference UID              UI: 2.25.39085688508687805643860350141257111202569461626559365918180
(0020, 1040) Position Reference Indicator        LO: 'SN'
(0020, 1041) Slice Location                      DS: '-384.109985'
(0028, 0000) Group Length                        UL: 200
(0028, 0002) Samples per Pixel                   US: 1
(0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
(0028, 0010) Rows                                US: 512
(0028, 0011) Columns                             US: 512
(0028, 0030) Pixel Spacing                       DS: ['0.693359', '0.693359']
(0028, 0100) Bits Allocated                      US: 16
(0028, 0101) Bits Stored                         US: 16
(0028, 0102) High Bit                            US: 15
(0028, 0103) Pixel Representation                US: 1
(0028, 0120) Pixel Padding Value                 US or SS: b'0\xf8'
(0028, 0301) Burned In Annotation                CS: 'NO'
(0028, 0303) Longitudinal Temporal Information M CS: 'MODIFIED'
(0028, 1050) Window Center                       DS: '-600'
(0028, 1051) Window Width                        DS: '1500'
(0028, 1052) Rescale Intercept                   DS: '-1024'
(0028, 1053) Rescale Slope                       DS: '1'
(7fe0, 0010) Pixel Data                          OB or OW: Array of 524288 bytes

Above, we iterate through each patient, we grab their label, we get the full path to that specific patient (inside THAT path contains ~200ish scans which we also iterate over, BUT also want to sort, since they wont necessarily be in proper order).

Do note here that the actual scan, when loaded by dicom, is clearly not JUST some sort of array of values, instead it's got attributes. There are a few attributes here of arrays, but not all of them. We're sorting by the actual image position in the scan. Later, we could actually put these together to get a full 3D rendering of the scan. That's not in my plans here, since that's already been something covered very well, see this kernel: https://www.kaggle.com/gzuidhof/data-science-bowl-2017/full-preprocessing-tutorial

One immediate thing to note here is those rows and columns...holy moly, 512 x 512! This means, our 3D rendering is a 195 x 512 x 512 right now. That's huge!

Alright, so we already know that we're going to absolutely need to resize this data. Being 512 x 512, I am already expecting all this data to be the same size, but let's see what we have from other patients too:

for patient in patients[:10]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    
    # a couple great 1-liners from: https://www.kaggle.com/gzuidhof/data-science-bowl-2017/full-preprocessing-tutorial
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    print(slices[0].pixel_array.shape, len(slices))
(512, 512) 195
(512, 512) 265
(512, 512) 233
(512, 512) 173
(512, 512) 146
(512, 512) 171
(512, 512) 123
(512, 512) 134
(512, 512) 135
(512, 512) 191 patient in patients[:10]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    
    # a couple great 1-liners from: https://www.kaggle.com/gzuidhof/data-science-bowl-2017/full-preprocessing-tutorial
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    print(slices[0].pixel_array.shape, len(slices))
(512, 512) 195
(512, 512) 265
(512, 512) 233
(512, 512) 173
(512, 512) 146
(512, 512) 171
(512, 512) 123
(512, 512) 134
(512, 512) 135
(512, 512) 191

Alright, so above we just went ahead and grabbed the pixel_array attribute, which is what I assume to be the scan slice itself (we will confirm this soon), but immediately I am surprised by this non-uniformity of slices. This isn't quite ideal and will cause a problem later. All of our images are the same size, but the slices arent. In terms of a 3D rendering, these actually are not the same size.

We've got to actually figure out a way to solve that uniformity problem, but also...these images are just WAY too big for a convolutional neural network to handle without some serious computing power.

Thus, we already know out of the gate that we're going to need to downsample this data quite a bit, AND somehow make the depth uniform.

Welcome to datascience!

Okay, next question is...just how much data do we have here?

len(patients)
1595(patients)
1595

Oh.

Well, that's also going to be a challenge for the convnet to figure out, but we're going to try! Also, there are outside datasources for more lung scans. For example, you can grab data from the LUNA2016 challenge: https://luna16.grand-challenge.org/data/for another 888 scans.

Do note that, if you do wish to compete, you can only use free datasets that are available to anyone who bothers to look.

I'll have us stick to just the base dataset, again mainly so anyone can poke around this code in the kernel environment.

Now, let's see what an actual slice looks like. If you do not have matplotlib, do pip install matplotlib

Want to learn more about Matplotlib? Check out the Data Visualization with Python and Matplotlib tutorial.

import matplotlib.pyplot as plt

for patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    
    #          the first slice
    plt.imshow(slices[0].pixel_array)
    plt.show() matplotlib.pyplot as plt

for patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    
    #          the first slice
    plt.imshow(slices[0].pixel_array)
    plt.show()

Now, I am not a doctor, but I'm going to claim a mini-victory and say that's our first CT scan slice.

We have about 200 slices though, I'd feel more comfortable if I saw a few more. Let's look at the first 12, and resize them with opencv. If you do not have opencv, do a pip install cv2

Want to learn more about what you can do with Open CV? Check out the Image analysis and manipulation with OpenCV and Python tutorial.

You will also need numpy here. You probably already have numpy if you installed pandas, but, just in case, numpy is pip install numpy

Section 2: Processing and viewing our Data¶

import cv2
import numpy as np

IMG_PX_SIZE = 150

for patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    fig = plt.figure()
    for num,each_slice in enumerate(slices[:12]):
        y = fig.add_subplot(3,4,num+1)
        new_img = cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE))
        y.imshow(new_img)
    plt.show() cv2
import numpy as np

IMG_PX_SIZE = 150

for patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    fig = plt.figure()
    for num,each_slice in enumerate(slices[:12]):
        y = fig.add_subplot(3,4,num+1)
        new_img = cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE))
        y.imshow(new_img)
    plt.show()

Alright, so we're resizing our images from 512x512 to 150x150. 150 is still going to wind up likely being waaaaaaay to big. That's fine, we can play with that constant more later, we just want to know how to do it.

Okay, so now what? I think we need to address the whole non-uniformity of depth next. To be honest, I don't know of any super smooth way of doing this, but that's fine. I can at least think of A way, and that's all we need.

My thought is that, what we have is really a big list of slices. What we need is to be able to just take any list of images, whether it's got 200 scans, 150 scans, or 300 scans, and set it to be some fixed number.

Let's say we want to have 20 scans instead. How can we do this?

Well, first, we need something that will take our current list of scans, and chunk it into a list of lists of scans.

I couldn't think of anything off the top of my head for this, so I Googled "how to chunk a list into a list of lists." This is how real programming is happens.

As per Ned Batchelder via Link: http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks, we've got ourselves a nice chunker generator. Awesome!

Thanks Ned!

Okay, once we've got these chunks of these scans, what are we going to do? Well, we can just average them together. My theory is that a scan is a few millimeters of actual tissue at most. Thus, we can hopefully just average this slice together, and maybe we're now working with a centimeter or so. If there's a growth there, it should still show up on scan.

This is just a theory, it has to be tested.

As we continue through this, however, you're hopefully going to see just how many theories we come up with, and how many variables we can tweak and change to possibly get better results.

import math

def chunks(l, n):
    # Credit: Ned Batchelder
    # Link: http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def mean(l):
    return sum(l) / len(l)

IMG_PX_SIZE = 150
HM_SLICES = 20

data_dir = 'X:/Kaggle_Data/datasciencebowl2017/sample_images/'
patients = os.listdir(data_dir)
labels_df = pd.read_csv('stage1_labels.csv', index_col=0)

for patient in patients[:5]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    new_slices = []
    slices = [cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in slices]
    chunk_sizes = math.ceil(len(slices) / HM_SLICES)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)
        
    print(len(slices), len(new_slices))
195 20
265 19
233 20
173 20
146 19 math

def chunks(l, n):
    # Credit: Ned Batchelder
    # Link: http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

def mean(l):
    return sum(l) / len(l)

IMG_PX_SIZE = 150
HM_SLICES = 20

data_dir = 'X:/Kaggle_Data/datasciencebowl2017/sample_images/'
patients = os.listdir(data_dir)
labels_df = pd.read_csv('stage1_labels.csv', index_col=0)

for patient in patients[:5]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    new_slices = []
    slices = [cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in slices]
    chunk_sizes = math.ceil(len(slices) / HM_SLICES)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)
        
    print(len(slices), len(new_slices))
195 20
265 19
233 20
173 20
146 19

The struggle is real. Okay, what you're about to see you shouldn't attempt if anyone else is watching, like if you're going to show your code to the public...

for patient in patients[:5]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    new_slices = []

    slices = [cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in slices]
    
    chunk_sizes = math.ceil(len(slices) / HM_SLICES)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)

    if len(new_slices) == HM_SLICES-1:
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES-2:
        new_slices.append(new_slices[-1])
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES+2:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
        
    if len(new_slices) == HM_SLICES+1:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
        
    print(len(slices), len(new_slices))
        
195 20
265 20
233 20
173 20
146 20 patient in patients[:5]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    new_slices = []

    slices = [cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in slices]
    
    chunk_sizes = math.ceil(len(slices) / HM_SLICES)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)

    if len(new_slices) == HM_SLICES-1:
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES-2:
        new_slices.append(new_slices[-1])
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES+2:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
        
    if len(new_slices) == HM_SLICES+1:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
        
    print(len(slices), len(new_slices))
        
195 20
265 20
233 20
173 20
146 20

Okay, the Python gods are really not happy with me for that hacky solution. If any of you would like to improve this chunking/averaging code, feel free. Really, any of this code...if you have improvements, share them! This is going to stay pretty messy. But hey, we did it! We figured out a way to make sure our 3 dimensional data can be at any resolution we want or need. Awesome!

That's actually a decently large hurdle. Are we totally done? ...maybe not. One major issue is these colors and ranges of data. It's unclear to me whether or not a model would appreciate that. Even if we do a grayscale colormap in the imshow, you'll see that some scans are just darker overall than others. This might be problematic and we might need to actually normalize this dataset.

I expect that, with a large enough dataset, this wouldn't be an actual issue, but, with this size of data, it might be of huge importance.

In effort to not turn this notebook into an actual book, however, we're going to move forward! We can now see our new data by doing:

for patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    new_slices = []

    slices = [cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in slices]
    
    chunk_sizes = math.ceil(len(slices) / HM_SLICES)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)

    if len(new_slices) == HM_SLICES-1:
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES-2:
        new_slices.append(new_slices[-1])
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES+2:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
        
    if len(new_slices) == HM_SLICES+1:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
    
    fig = plt.figure()
    for num,each_slice in enumerate(new_slices):
        y = fig.add_subplot(4,5,num+1)
        y.imshow(each_slice, cmap='gray')
    plt.show() patient in patients[:1]:
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))
    new_slices = []

    slices = [cv2.resize(np.array(each_slice.pixel_array),(IMG_PX_SIZE,IMG_PX_SIZE)) for each_slice in slices]
    
    chunk_sizes = math.ceil(len(slices) / HM_SLICES)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)

    if len(new_slices) == HM_SLICES-1:
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES-2:
        new_slices.append(new_slices[-1])
        new_slices.append(new_slices[-1])

    if len(new_slices) == HM_SLICES+2:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
        
    if len(new_slices) == HM_SLICES+1:
        new_val = list(map(mean, zip(*[new_slices[HM_SLICES-1],new_slices[HM_SLICES],])))
        del new_slices[HM_SLICES]
        new_slices[HM_SLICES-1] = new_val
    
    fig = plt.figure()
    for num,each_slice in enumerate(new_slices):
        y = fig.add_subplot(4,5,num+1)
        y.imshow(each_slice, cmap='gray')
    plt.show()

Section 3: Preprocessing our Data¶

Okay, so we know what we've got, and what we need to do with it.

We have a few options at this point, we could take the code that we have already and do the processing "online." By this, I mean, while training the network, we can actually just loop over our patients, resize the data, then feed it through our neural network. We actually don't have to have all of the data prepared before we go through the network.

If you can preprocess all of the data into one file, and that one file doesn't exceed your available memory, then training should likely be faster, so you can more easily tweak your neural network and not be processing your data the same way over and over.

In many more realistic examples in the world, however, your dataset will be so large, that you wouldn't be able to read it all into memory at once anyway, but you could still maintain one big database or something.

Bottom line: There are tons of options here. Our dataset is only 1500 (even less if you are following in the Kaggle kernel) patients, and will be, for example, 20 slices of 150x150 image data if we went off the numbers we have now, but this will need to be even smaller for a typical computer most likely. Regardless, this much data wont be an issue to keep in memory or do whatever the heck we want.

If at all possible, I prefer to separate out steps in any big process like this, so I am going to go ahead and pre-process the data, so our neural network code is much simpler. Also, there's no good reason to maintain a network in GPU memory while we're wasting time processing the data which can be easily done on a CPU.

Now, I will just make a slight modification to all of the code up to this point, and add some new final lines to preprocess this data and save the array of arrays to a file:

import numpy as np
import pandas as pd
import dicom
import os
import matplotlib.pyplot as plt
import cv2
import math

IMG_SIZE_PX = 50
SLICE_COUNT = 20

def chunks(l, n):
    # Credit: Ned Batchelder
    # Link: http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


def mean(a):
    return sum(a) / len(a)


def process_data(patient,labels_df,img_px_size=50, hm_slices=20, visualize=False):
    
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))

    new_slices = []
    slices = [cv2.resize(np.array(each_slice.pixel_array),(img_px_size,img_px_size)) for each_slice in slices]
    
    chunk_sizes = math.ceil(len(slices) / hm_slices)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)

    if len(new_slices) == hm_slices-1:
        new_slices.append(new_slices[-1])

    if len(new_slices) == hm_slices-2:
        new_slices.append(new_slices[-1])
        new_slices.append(new_slices[-1])

    if len(new_slices) == hm_slices+2:
        new_val = list(map(mean, zip(*[new_slices[hm_slices-1],new_slices[hm_slices],])))
        del new_slices[hm_slices]
        new_slices[hm_slices-1] = new_val
        
    if len(new_slices) == hm_slices+1:
        new_val = list(map(mean, zip(*[new_slices[hm_slices-1],new_slices[hm_slices],])))
        del new_slices[hm_slices]
        new_slices[hm_slices-1] = new_val

    if visualize:
        fig = plt.figure()
        for num,each_slice in enumerate(new_slices):
            y = fig.add_subplot(4,5,num+1)
            y.imshow(each_slice, cmap='gray')
        plt.show()

    if label == 1: label=np.array([0,1])
    elif label == 0: label=np.array([1,0])
        
    return np.array(new_slices),label

#                                               stage 1 for real.
data_dir = 'X:/Kaggle_Data/datasciencebowl2017/sample_images/'
patients = os.listdir(data_dir)
labels = pd.read_csv('stage1_labels.csv', index_col=0)

much_data = []
for num,patient in enumerate(patients):
    if num % 100 == 0:
        print(num)
    try:
        img_data,label = process_data(patient,labels,img_px_size=IMG_SIZE_PX, hm_slices=SLICE_COUNT)
        #print(img_data.shape,label)
        much_data.append([img_data,label])
    except KeyError as e:
        print('This is unlabeled data!')

np.save('muchdata-{}-{}-{}.npy'.format(IMG_SIZE_PX,IMG_SIZE_PX,SLICE_COUNT), much_data)
0
This is unlabeled data! numpy as np
import pandas as pd
import dicom
import os
import matplotlib.pyplot as plt
import cv2
import math

IMG_SIZE_PX = 50
SLICE_COUNT = 20

def chunks(l, n):
    # Credit: Ned Batchelder
    # Link: http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


def mean(a):
    return sum(a) / len(a)


def process_data(patient,labels_df,img_px_size=50, hm_slices=20, visualize=False):
    
    label = labels_df.get_value(patient, 'cancer')
    path = data_dir + patient
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))

    new_slices = []
    slices = [cv2.resize(np.array(each_slice.pixel_array),(img_px_size,img_px_size)) for each_slice in slices]
    
    chunk_sizes = math.ceil(len(slices) / hm_slices)
    for slice_chunk in chunks(slices, chunk_sizes):
        slice_chunk = list(map(mean, zip(*slice_chunk)))
        new_slices.append(slice_chunk)

    if len(new_slices) == hm_slices-1:
        new_slices.append(new_slices[-1])

    if len(new_slices) == hm_slices-2:
        new_slices.append(new_slices[-1])
        new_slices.append(new_slices[-1])

    if len(new_slices) == hm_slices+2:
        new_val = list(map(mean, zip(*[new_slices[hm_slices-1],new_slices[hm_slices],])))
        del new_slices[hm_slices]
        new_slices[hm_slices-1] = new_val
        
    if len(new_slices) == hm_slices+1:
        new_val = list(map(mean, zip(*[new_slices[hm_slices-1],new_slices[hm_slices],])))
        del new_slices[hm_slices]
        new_slices[hm_slices-1] = new_val

    if visualize:
        fig = plt.figure()
        for num,each_slice in enumerate(new_slices):
            y = fig.add_subplot(4,5,num+1)
            y.imshow(each_slice, cmap='gray')
        plt.show()

    if label == 1: label=np.array([0,1])
    elif label == 0: label=np.array([1,0])
        
    return np.array(new_slices),label

#                                               stage 1 for real.
data_dir = 'X:/Kaggle_Data/datasciencebowl2017/sample_images/'
patients = os.listdir(data_dir)
labels = pd.read_csv('stage1_labels.csv', index_col=0)

much_data = []
for num,patient in enumerate(patients):
    if num % 100 == 0:
        print(num)
    try:
        img_data,label = process_data(patient,labels,img_px_size=IMG_SIZE_PX, hm_slices=SLICE_COUNT)
        #print(img_data.shape,label)
        much_data.append([img_data,label])
    except KeyError as e:
        print('This is unlabeled data!')

np.save('muchdata-{}-{}-{}.npy'.format(IMG_SIZE_PX,IMG_SIZE_PX,SLICE_COUNT), much_data)
0
This is unlabeled data!

3D Convolutional Neural Network¶

Moment-o-truth

Okay, we've got preprocessed, normalized, data. Now we're ready to feed it through our 3D convnet and...see what happens!

Now, I am not about to stuff a neural networks tutorial into this one. If you're already familiar with neural networks and TensorFlow, great! If not, as you might guess, I have a tutorial...or tutorials... for you!

To install the CPU version of TensorFlow, just do pip install tensorflow

To install the GPU version of TensorFlow, you need to get alllll the dependencies and such.

Installation tutorials:

Installing the GPU version of TensorFlow in Ubuntu

Installing the GPU version of TensorFlow on a Windows machine

Using TensorFlow and concept tutorials:

Introduction to deep learning with neural networks

Introduction to TensorFlow Intro to Convolutional Neural Networks

Convolutional Neural Network in TensorFlow tutorial

Now, the data we have is actually 3D data, not 2D data that's covered in most convnet tutorials, including mine above. So what changes? EVERYTHING! OMG IT'S THE END OF THE WORLD AS WE KNOW IT!!

It's not really all too bad. Your convolutional window/padding/strides need to change. Do note that, now, to have a bigger window, your processing penalty increases significantly as we increase in size, obviously much more than with 2D windows.

Okay, let's begin.

import tensorflow as tf
import numpy as np

IMG_SIZE_PX = 50
SLICE_COUNT = 20

n_classes = 2
batch_size = 10

x = tf.placeholder('float')
y = tf.placeholder('float')

keep_rate = 0.8 tensorflow as tf
import numpy as np

IMG_SIZE_PX = 50
SLICE_COUNT = 20

n_classes = 2
batch_size = 10

x = tf.placeholder('float')
y = tf.placeholder('float')

keep_rate = 0.8

Not too bad to start, just some typical constants, some imports, we're ready to rumble. Let's begin with conv3d and maxpooling.

def conv3d(x, W):
    return tf.nn.conv3d(x, W, strides=[1,1,1,1,1], padding='SAME')

def maxpool3d(x):
    #                        size of window         movement of window as you slide about
    return tf.nn.max_pool3d(x, ksize=[1,2,2,2,1], strides=[1,2,2,2,1], padding='SAME') conv3d(x, W):
    return tf.nn.conv3d(x, W, strides=[1,1,1,1,1], padding='SAME')

def maxpool3d(x):
    #                        size of window         movement of window as you slide about
    return tf.nn.max_pool3d(x, ksize=[1,2,2,2,1], strides=[1,2,2,2,1], padding='SAME')

Now we're ready for the network itself:

def convolutional_neural_network(x):
    #                # 5 x 5 x 5 patches, 1 channel, 32 features to compute.
    weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,3,1,32])),
               #       5 x 5 x 5 patches, 32 channels, 64 features to compute.
               'W_conv2':tf.Variable(tf.random_normal([3,3,3,32,64])),
               #                                  64 features
               'W_fc':tf.Variable(tf.random_normal([54080,1024])),
               'out':tf.Variable(tf.random_normal([1024, n_classes]))}

    biases = {'b_conv1':tf.Variable(tf.random_normal([32])),
               'b_conv2':tf.Variable(tf.random_normal([64])),
               'b_fc':tf.Variable(tf.random_normal([1024])),
               'out':tf.Variable(tf.random_normal([n_classes]))}

    #                            image X      image Y        image Z
    x = tf.reshape(x, shape=[-1, IMG_SIZE_PX, IMG_SIZE_PX, SLICE_COUNT, 1])

    conv1 = tf.nn.relu(conv3d(x, weights['W_conv1']) + biases['b_conv1'])
    conv1 = maxpool3d(conv1)


    conv2 = tf.nn.relu(conv3d(conv1, weights['W_conv2']) + biases['b_conv2'])
    conv2 = maxpool3d(conv2)

    fc = tf.reshape(conv2,[-1, 54080])
    fc = tf.nn.relu(tf.matmul(fc, weights['W_fc'])+biases['b_fc'])
    fc = tf.nn.dropout(fc, keep_rate)

    output = tf.matmul(fc, weights['out'])+biases['out']

    return output convolutional_neural_network(x):
    #                # 5 x 5 x 5 patches, 1 channel, 32 features to compute.
    weights = {'W_conv1':tf.Variable(tf.random_normal([3,3,3,1,32])),
               #       5 x 5 x 5 patches, 32 channels, 64 features to compute.
               'W_conv2':tf.Variable(tf.random_normal([3,3,3,32,64])),
               #                                  64 features
               'W_fc':tf.Variable(tf.random_normal([54080,1024])),
               'out':tf.Variable(tf.random_normal([1024, n_classes]))}

    biases = {'b_conv1':tf.Variable(tf.random_normal([32])),
               'b_conv2':tf.Variable(tf.random_normal([64])),
               'b_fc':tf.Variable(tf.random_normal([1024])),
               'out':tf.Variable(tf.random_normal([n_classes]))}

    #                            image X      image Y        image Z
    x = tf.reshape(x, shape=[-1, IMG_SIZE_PX, IMG_SIZE_PX, SLICE_COUNT, 1])

    conv1 = tf.nn.relu(conv3d(x, weights['W_conv1']) + biases['b_conv1'])
    conv1 = maxpool3d(conv1)


    conv2 = tf.nn.relu(conv3d(conv1, weights['W_conv2']) + biases['b_conv2'])
    conv2 = maxpool3d(conv2)

    fc = tf.reshape(conv2,[-1, 54080])
    fc = tf.nn.relu(tf.matmul(fc, weights['W_fc'])+biases['b_fc'])
    fc = tf.nn.dropout(fc, keep_rate)

    output = tf.matmul(fc, weights['out'])+biases['out']

    return output

Why 54080 magic number? To get this, I simply run the script once, and see what the error yells at me for the expected size multiple. This is certainly not the right way to go about it, but that's my 100% honest method, and my first time working in a 3D convnet. AFAIK, it's the padding that causes this to not be EXACTLY 50,000, (50 x 50 x 20 is the size of our actual input data, which is 50,000 total). Even with "VALID" padding, this is still strange to me. Someone feel free to enlighten me how one could actually calculate this number beforehand. This happens when the fixed window reaches the edge of your data. Either it can just never cross the edge, or, we can allow it to cross edges, and, where there is "no data," just simply pad in the data with the "same" data as what was there before.

Now we're set to train the network:

much_data = np.load('muchdata-50-50-20.npy')
# If you are working with the basic sample data, use maybe 2 instead of 100 here... you don't have enough data to really do this
train_data = much_data[:-100]
validation_data = much_data[-100:]


def train_neural_network(x):
    prediction = convolutional_neural_network(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-5).minimize(cost)
    
    hm_epochs = 1
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        
        successful_runs = 0
        total_runs = 0
        
        for epoch in range(hm_epochs):
            for data in train_data:
                total_runs += 1
                try:
                    epoch_loss = 0
                    X = data[0]
                    Y = data[1]
                    #print(Y)
                    _, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
                    epoch_loss += c
                    successful_run += 1
                except Exception as e:
                    pass
                    #print(str(e))
            
            print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)

            correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
            accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

            print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
            
        print('Done. Finishing accuracy:')
        print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
        
        print('fitment percent:',successful_runs/total_runs)


train_neural_network(x)
WARNING:tensorflow:From :14 in train_neural_network.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
Epoch 1 completed out of 3 loss: 0.0
Accuracy: 0.59
Epoch 2 completed out of 3 loss: 0.0
Accuracy: 0.64
Epoch 3 completed out of 3 loss: 0.0
Accuracy: 0.56
Done. Finishing accuracy:
Accuracy: 0.59
fitment percent: 0.0= np.load('muchdata-50-50-20.npy')
# If you are working with the basic sample data, use maybe 2 instead of 100 here... you don't have enough data to really do this
train_data = much_data[:-100]
validation_data = much_data[-100:]


def train_neural_network(x):
    prediction = convolutional_neural_network(x)
    cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(prediction,y) )
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-5).minimize(cost)
    
    hm_epochs = 1
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        
        successful_runs = 0
        total_runs = 0
        
        for epoch in range(hm_epochs):
            for data in train_data:
                total_runs += 1
                try:
                    epoch_loss = 0
                    X = data[0]
                    Y = data[1]
                    #print(Y)
                    _, c = sess.run([optimizer, cost], feed_dict={x: X, y: Y})
                    epoch_loss += c
                    successful_run += 1
                except Exception as e:
                    pass
                    #print(str(e))
            
            print('Epoch', epoch+1, 'completed out of',hm_epochs,'loss:',epoch_loss)

            correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1))
            accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

            print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
            
        print('Done. Finishing accuracy:')
        print('Accuracy:',accuracy.eval({x:[i[0] for i in validation_data], y:[i[1] for i in validation_data]}))
        
        print('fitment percent:',successful_runs/total_runs)


train_neural_network(x)
WARNING:tensorflow:From :14 in train_neural_network.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
Epoch 1 completed out of 3 loss: 0.0
Accuracy: 0.59
Epoch 2 completed out of 3 loss: 0.0
Accuracy: 0.64
Epoch 3 completed out of 3 loss: 0.0
Accuracy: 0.56
Done. Finishing accuracy:
Accuracy: 0.59
fitment percent: 0.0

你可能感兴趣的:(kaggle)

python3中的os.path模块 hgz_dm 编程语言 python3 os.path
os.path模块主要用于获取文件的属性，这里对该模块中一些常用的函数做些记录。os.abspath(path):获取文件的绝对路径。这里path指的是路径，例如我这里输入“data.csv”[In]os.path.abspath('data.csv')[Out]'E:\\kaggle\\Titanic\\data.csv'os.path.basename(path):获取文件名称。该函数默认通过
基于机器学习的恶意软件检测系统的详细设计与实现源码空间站11 机器学习人工智能课程设计 python 网络安全信息安全恶意软件检测
以下是一个基于机器学习的恶意软件检测系统的详细设计与实现，适合作为课程作业或项目开发。我们将实现一个通过机器学习模型分析恶意软件特征来检测文件是否为恶意软件的系统。总体思路数据准备：选择现有的恶意软件数据集（如Kaggle的恶意软件数据集）或构造模拟数据集。数据集中包含文件的特征（如二进制特征、字符串特征、API调用特征等）和标签（"恶意"或"正常"）。特征提取：提取文件的静态特征（如文件大小、字
chatglm3如何进行微调 learner_ctr 人工智能 chatglm3 llm
一、需要的环境内存：因为在loadmodel时，是先放在内存里面，所以内存不能小，最好在30GB左右显存：如果用half()精度来loadmodel的话(int4是不支持微调的)，显存在16GB就可以，比如可以用kaggle的t4gpu，这款性能相当于2070系列，但是显存翻倍python：3.10即可需要安装的包和版本：!pipinstallmodelscope-ihttps://pypi.tu
编程小白冲Kaggle每日打卡（6）--kaggle学堂：＜Python＞功能和获取帮助 AZmax01 编程小白冲Kaggle每日打卡 python 开发语言
Kaggle官方课程链接：FunctionsandGettingHelp本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。目录FunctionsandGettingHelpGettingHelpDefiningfunctionsDocstringsFunctionsthatdon'treturnDefaultargumentsFunctionsAppliedtoFunctionsYourT
1.7 Kaggle大白话：Eedi竞赛Transformer框架解决方案07-调用AI模型输出结果 AI量金术师 Kaggle竞赛人工智能 transformer 深度学习 python 算法
目录0.本栏目竞赛汇总表1.本文主旨2.调用AI模型输出结果架构3.模型准备3.1代码实现3.2大白话模型准备4.数据处理4.1代码实现4.2大白话数据处理5.特征提取5.1代码实现5.2大白话特征提取6.相似度匹配6.1代码实现6.2大白话相似度匹配7.系列总结7.1章节回顾7.2竞赛排名7.3其他优秀项目（皆为竞赛金牌）0.本栏目竞赛汇总表Kaggle竞赛汇总1.本文主旨大白话：上一篇文章中，
编程小白冲Kaggle每日打卡（17）--kaggle学堂：＜机器学习简介＞随机森林 AZmax01 编程小白冲Kaggle每日打卡机器学习随机森林人工智能
Kaggle官方课程链接：RandomForests本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。RandomForests使用更复杂的机器学习算法。介绍决策树给你留下了一个艰难的决定。一棵有很多叶子的深树会被过度拟合，因为每一个预测都来自它叶子上少数房子的历史数据。但是，叶子很少的浅树表现不佳，因为它无法在原始数据中捕捉到尽可能多的区别。即使是当今最复杂的建模技术也面临着欠拟合和过拟
0. Kaggle实战：Kaggle竞赛实战记录列表（持续更新） AI量金术师 Kaggle竞赛人工智能 python 开发语言机器学习金融
目录1.专栏描述2.Kaggle竞赛列表2.1Eedi-MiningMisconceptionsinMathematics（持续更新中）1.专栏描述本专栏专注于记录与分享Kaggle竞赛的解题思路、项目框架及代码实现。通过通俗易懂的讲解和简单明了的测试数据，帮助每位读者轻松掌握参赛技巧，快速提升实战能力，一起探索数据科学的魅力！2.Kaggle竞赛列表2.1Eedi-MiningMisconcep
编程小白冲Kaggle每日打卡（7）--kaggle学堂：＜Python＞布尔型和条件形 AZmax01 编程小白冲Kaggle每日打卡 python 开发语言
Kaggle课程官网链接：BooleansandConditionals本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。目录BooleansandConditionalsBooleansComparisonOperationsCombiningBooleanValuesConditionalsBooleanconversionYourTurnBooleansandConditionals
编程小白冲Kaggle每日打卡（4）--kaggle学堂：＜编程简介＞列表 AZmax01 编程小白冲Kaggle每日打卡机器学习人工智能 python
Kaggle课程官网链接：IntrotoLists本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。IntrotoLists整理您的数据，以便您能够高效地使用它。Introduction在进行数据科学研究时，您需要一种组织数据的方法，以便高效地使用它。Python有许多数据结构可用于保存数据，如列表、集合、字典和元组。在本教程中，您将学习如何使用Python列表。Motivation在“花
编程小白冲Kaggle每日打卡（5）--kaggle学堂：＜Python＞Hello,Python! AZmax01 编程小白冲Kaggle每日打卡 python 机器学习深度学习
Kaggle课程官方链接：Hello,Python本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。Hello,PythonPython语法、变量赋值和数字的快速介绍本课程涵盖了您需要的关键Python技能，以便您可以开始将Python用于数据科学。这门课程非常适合那些有一些编程经验的人，他们想把Python添加到他们的技能库中。（如果你是第一次编程，我们鼓励你查看我们的编程入门课程，该课
自编大模型系列之 01 使用 Python 从头构建 LLaMA 3 编写您自己的十亿参数LLM（教程含源码）知识大胖 NVIDIA GPU和大语言模型开发教程 python llama 开发语言
LLaMA3是继Mistral之后最有前途的开源模型之一，可以解决各种任务。我之前在Medium上写过一篇博客，介绍如何使用LLaMA架构从头开始创建一个具有超过230万个参数的LLM。现在LLaMA-3已经发布，我们将以更简单的方式重新创建它。我们不会在本博客中使用GPU，但您至少需要17GB的RAM，因为我们将加载一些大小超过15GB的文件。如果这对您来说是个问题，您可以使用Kaggle作为解
编程小白冲Kaggle每日打卡（14）--kaggle学堂：＜机器学习简介＞你的第一个机器学习模型 AZmax01 编程小白冲Kaggle每日打卡机器学习人工智能
Kaggle官方课程链接：YourFirstMachineLearningModel本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。YourFirstMachineLearningModel建立你的第一个模型。好哇！选择建模数据你的数据集有太多的变量，你无法理解，甚至无法很好地打印出来。你如何将如此庞大的数据量缩减到你能理解的程度？我们将从使用直觉选择几个变量开始。后续课程将向您展示自动
《机器学习实战》专栏 No12：项目实战—端到端的机器学习项目Kaggle糖尿病预测带娃的IT创业者机器学习实战机器学习人工智能分类算法 python
《机器学习实战》专栏第12集：项目实战——端到端的机器学习项目Kaggle糖尿病预测本集为专栏最后一集，本专栏的特点是短平快，聚焦重点，不长篇大论纠缠于理论，而是在介绍基础理论框架基础上，快速切入实战项目和代码，所有代码都经过实践检验，是读者入门和熟悉上手的上佳知识材料在本集中，我们将通过Kaggle平台的经典糖尿病预测（PimaIndiansDiabetesDataset）数据集，系统回顾完整的
人工智能与机器学习入门：决策树应用决策树机器学习入门
在人工智能与机器学习入门：使用Kaggle完成Titanic推断学习一文中，给出了使用Kaggle进行机器学习入门的方法，本文基于上文的需求。尝试使用决策树模型来训练数据，并进行test数据集的测试。什么是决策树决策树，简单来讲可以认为是一个大的ifelse判断树，有了决策树后，测试集中的数据便可以使用该决策树进行判断了。比如根据Titanic的训练数据构造了上次决策树后，便可以根据测试数据的性别
编程小白冲Kaggle每日打卡（8）--kaggle学堂：＜Python＞列表 AZmax01 编程小白冲Kaggle每日打卡 python windows 开发语言
Kaggle课程官方链接：Lists本专栏旨在Kaggle官方课程的汉化，让大家更方便地看懂。Lists¶列表以及你可以用它们做的事情。包括索引、切片和变异Python中的列表表示值的有序序列。以下是一个如何创建它们的示例：primes=[2,3,5,7]我们可以把其他类型的东西放在列表中：planets=['Mercury','Venus','Earth','Mars','Jupiter','S
机器学习基本篇胖胖的小肥猫机器学习
1基本概念机器学习，分为回归，分类，聚类，降维有监督学习回归，分类，有特征，有标签，进行训练，然后对新数据进行预测无监督学习聚类，降维。题目越多，训练越好，2基本流程数据预处理——模型训练与评估可以优化为获取数据——数据预处理——EDA分析——特征工程——模型训练——可解释性分析2.0数据获取利用kaggle,天池等平台的开源数据，2.1预处理目的：让数据更符合逻辑让数据更容易计算借助函数实现变换
更符合DeepSeek的提问方式，学术论文方面的能力我总结了这几十个提示词！ AIWritePaper官方账号 AIWritePaper DeepSeek 学术论文人工智能 chatgpt 数据分析 prompt 论文阅读
DeepSeek提问技巧总结1.聚焦核心，细化问题：提问时应精准明确，避免过于宽泛或模糊。例如不要问“如何学习机器学习？”而应问“零基础如何机器学习”。对于复杂问题，可将其拆解为多个小问题，逐一提问。比如先问“学习机器学习先学习python更好吗？”再问“如何用Kaggle进行机器学习相关的数据竞赛？”2.提供背景，结构化描述：在提问时，提供问题的背景信息或目标，以便DeepSeek更准确地理解需
DeepSeek API 输出解析【非流式输出篇】 - OpenAI SDK Hoper.J AIGC DeepSeek DeepSeek API AI
代码文件下载：Code在线链接：Kaggle|Colab前置文章：DeepSeekAPI的获取与对话示例文章目录如何切换平台认识输出DeepSeek-ChatDeepSeek-Reasoner附录如何切换平台本文不引入环境变量，如果对其感兴趣可以阅读《初识LLMAPI：环境配置与多轮对话演示》的「环境变量配置」部分。代码文件已包含文章中所有平台的正确配置。以DeepSeek单轮对话的代码样例进行讲
【深度学习实战：kaggle自然场景的图像分类-----使用keras框架实现vgg16的迁移学习】机器学习司猫白深度学习分类 keras
Hello大家好，今天和大家分享一个kaggle自然场景的图像分类的竞赛，使用的keras框架实现vgg16的迁移学习完成自然场景分类，对数据集感兴趣的同学可以在上方下载数据集。项目简介本次数据集来自kaggle，该数据集包括自然场景的图像。模型应该预测每个图像的正确标签。您的目标是实现分类问题的高精度。数据集train.csv-训练集test.csv-测试集SceneImages-图像文件夹训练
视频分析：基于目标检测（YOLO）实现走路看手机检测、玩手机检测、跌倒检测等 shiter 人工智能系统解决方案与技术架构音视频深度学习人工智能
文章大纲背景行为检测的定义与挑战视频分析数据集目标检测数据集自制数据集思路Kaggle数据集COCO数据集OpenImagesDatasetV7人类行为视频分析yolo进行行为分析的检测看手机行为检测--方法与数据集方法数据集跌倒行为检测--方法与数据集跌倒检测-数据集跌倒检测-目标检测跌倒检测-姿态估计参考文献与学习路径背景行为检测在自动驾驶、视频监控等领域的广阔应用前景使其成为了视频分析的研究
kaggle花分类比赛91.168% 仙尊方媛分类数据挖掘机器学习 keras tensorflow
之前一直都没注意显存，也没注意数据格式，直到跑模型的时候电脑直接崩了，因为排队用TPU，感觉人多，就直接在自己电脑上跑，我自己是有一张8G的4070,没想到啊，光是读取数据，就占用了6G历次成绩这个是用分布式gpu跑的，kaggle给配了两张16G显存的卡，TPU我前面56个人，人太多了,分辨率本身有影响，我使用192×192这里使用512×512的分辨率，效果明显提高了，Tan和Le，2019年
DeepSeek API 的获取与对话示例 Hoper.J AIGC DeepSeek API AI
代码文件下载：Code在线链接：Kaggle|Colab文章目录注册并获取API环境依赖设置API单轮对话多轮对话流式输出更换模型注册并获取API访问https://platform.deepseek.com/sign_in进行注册并登录：新用户注册后将赠送10块钱余额，有效期为一个月：点击左侧的APIkeys（或者访问https://platform.deepseek.com/api_keys）
使用 Python 的 LSTM 进行股市预测无水先生数据分析深度学习人工智能综合 python lstm 开发语言
目录一、说明二、为什么需要时间序列模型？三、下载数据3.1从Alphavantage获取数据3.1从Kaggle获取数据3.3数据探索3.4数据可视化四、将数据拆分为训练集和测试集五、数据标准化六、通过平均进行一步预测6.1标准平均值6.2指数移动平均线6.3如果指数移动平均线这么好，为什么还需要更好的模型？6.4预测未来不止一步七、LSTM简介：预测未来的股票走势7.1数据生成器7.2数据增强7
【AI日记】25.01.25 AI完全体 AI日记人工智能 kaggle 比赛机器学习读书
【AI论文解读】【AI知识点】【AI小项目】【AI战略思考】【AI日记】【读书与思考】AIkaggle比赛：ForecastingStickerSales读书书名：法治的细节律己AI：8小时，良作息：00:30-8:30，良短视频：大于1小时，差读书和写作：1小时，优饮食：安全健康
Kaggle房价预测一名小菜鸟的学习之路深度学习pytorch 深度学习机器学习 python 人工智能神经网络
Kaggle房价预测作为深度学习基础篇章的总结，我们将对本章内容学以致用。下面，让我们动手实战一个Kaggle比赛：房价预测。本节将提供未经调优的数据的预处理、模型的设计和超参数的选择。我们希望读者通过动手操作、仔细观察实验现象、认真分析实验结果并不断调整方法，得到令自己满意的结果。%matplotlibinlineimporttorchimporttorch.nnasnnimportnumpya
6 回归集成：xgb、lgb、cat 汀沿河 #2比赛常用的代码回归数据挖掘人工智能
这个代码是从kaggle上拷贝过来的：如何使用三个树模型模块化训练；文本特征如何做，如何挖掘；时间特征的处理；模型权重集成；importpandasaspdimportmathimportnumpyasnpimportjoblibimportoptunafromlightgbmimportLGBMRegressorfromcatboostimportCatBoostRegressorfromxgb
kaggle上面有哪些适合机器学习新手的比赛和项目 xiamu_CDA 机器学习人工智能
Kaggle上面有哪些适合机器学习新手的比赛和项目？在当今数据驱动的时代，机器学习已经成为一门炙手可热的技能。Kaggle作为全球最大的数据科学竞赛平台，不仅汇聚了众多顶尖的数据科学家和机器学习工程师，也为初学者提供了丰富的学习资源和实战机会。对于机器学习新手来说，选择合适的比赛和项目是至关重要的第一步。本文将为你推荐一些适合新手的Kaggle比赛和项目，并提供一些实用的建议，帮助你在机器学习的道
python实战（十五）——中文手写体数字图像CNN分类 CM莫问 python实战深度学习 python cnn 人工智能深度学习算法图像分类手写体识别
一、任务背景本次python实战，我们使用来自Kaggle的数据集《ChineseMNIST》进行CNN分类建模，不同于经典的MNIST数据集，我们这次使用的数据集是汉字手写体数字。除了常规的汉字“零”到“九”之外还多了“十”、“百”、“千”、“万”、“亿”，共15种汉字数字。二、python建模1、数据读取首先，读取jpg数据文件，可以看到总共有15000张图像数据。importpandasas
【数据挖掘实战】房价预测机器学习司猫白数据挖掘人工智能 python 机器学习
本次对kaggle中的入门级数据集，房价回归数据集进行数据挖掘，预测房屋价格。本人主页：机器学习司猫白机器学习专栏：机器学习实战PyTorch入门专栏：PyTorch入门深度学习实战：深度学习ok，话不多说，我们进入正题吧概述本次竞赛有79个解释变量（几乎）描述了爱荷华州艾姆斯住宅的各个方面，需要预测每套住宅的最终价格。数据集描述本次数据集已经上传，大家可以自行下载尝试文件说明train.csv-
kaggle入门级竞赛Spaceship Titanic LIghtgbm+Optuna调参机器学习司猫白机器学习实战机器学习 python 集成学习 scikit-learn
kaggle入门级竞赛SpaceshipTitanic简介数据介绍数据集描述数据字段描述train.csv-约三分之二（~8700）乘客的个人记录，用作培训数据。test.csv-剩余三分之一（~4300）乘客的个人记录，用作测试数据。您的任务是预测Transported该集合中乘客的价值。Sample_submission.csv-格式正确的提交文件。代码分类变量optuna算法简介简介欢迎来到
ASM系列六利用TreeApi 添加和移除类成员 lijingyao8206 jvm 动态代理 ASM 字节码技术 TreeAPI
同生成的做法一样，添加和移除类成员只要去修改fields和methods中的元素即可。这里我们拿一个简单的类做例子，下面这个Task类，我们来移除isNeedRemove方法，并且添加一个int 类型的addedField属性。 package asm.core; /** * Created by yunshen.ljy on 2015/6/
Springmvc-权限设计 bee1314 spring Web jsp
万丈高楼平地起。权限管理对于管理系统而言已经是标配中的标配了吧，对于我等俗人更是不能免俗。同时就目前的项目状况而言，我们还不需要那么高大上的开源的解决方案，如Spring Security，Shiro。小伙伴一致决定我们还是从基本的功能迭代起来吧。目标： 1.实现权限的管理（CRUD） 2.实现部门管理（CRUD) 3.实现人员的管理（CRUD） 4.实现部门和权限
算法竞赛入门经典（第二版）第2章习题 CrazyMizzz c 算法
2.4.1 输出技巧 #include <stdio.h> int main() { int i, n; scanf("%d", &n); for (i = 1; i <= n; i++) printf("%d\n", i); return 0; } 习题2-2 水仙花数(daffodil
struts2中jsp自动跳转到Action 麦田的设计者 jsp webxml struts2 自动跳转
1、在struts2的开发中，经常需要用户点击网页后就直接跳转到一个Action，执行Action里面的方法，利用mvc分层思想执行相应操作在界面上得到动态数据。毕竟用户不可能在地址栏里输入一个Action（不是专业人士） 2、＜jsp:forward page="xxx.action" /＞，这个标签可以实现跳转，page的路径是相对地址,不同与jsp和j
php 操作webservice实例 IT独行者 PHP webservice
首先大家要简单了解了何谓webservice，接下来就做两个非常简单的例子，webservice还是逃不开server端与client端。我测试的环境为：apache2.2.11 php5.2.10做这个测试之前，要确认你的php配置文件中已经将soap扩展打开，即extension=php_soap.dll; OK 现在我们来体验webservice //server端 serve
Windows下使用Vagrant安装linux系统 _wy_ windows vagrant
准备工作：下载安装 VirtualBox ：https://www.virtualbox.org/ 下载安装 Vagrant ：http://www.vagrantup.com/ 下载需要使用的 box ：官方提供的范例：http://files.vagrantup.com/precise32.box 还可以在 http://www.vagrantbox.es/
更改linux的文件拥有者及用户组(chown和chgrp) 无量 c linux chgrp chown
本文（转） http://blog.163.com/yanenshun@126/blog/static/128388169201203011157308/ http://ydlmlh.iteye.com/blog/1435157 一、基本使用：使用chown命令可以修改文件或目录所属的用户：命令
linux下抓包工具矮蛋蛋 linux
原文地址： http://blog.chinaunix.net/uid-23670869-id-2610683.html tcpdump -nn -vv -X udp port 8888 上面命令是抓取udp包、端口为8888 netstat -tln 命令是用来查看linux的端口使用情况 13 . 列出所有的网络连接 lsof -i 14. 列出所有tcp 网络连接信息 l
我觉得mybatis是垃圾！：“每一个用mybatis的男纸，你伤不起” alafqq mybatis
最近看了每一个用mybatis的男纸，你伤不起原文地址：http://www.iteye.com/topic/1073938 发表一下个人看法。欢迎大神拍砖；个人一直使用的是Ibatis框架，公司对其进行过小小的改良；最近换了公司，要使用新的框架。听说mybatis不错；就对其进行了部分的研究；发现多了一个mapper层；个人感觉就是个dao；
解决java数据交换之谜百合不是茶数据交换
交换两个数字的方法有以下三种，其中第一种最常用 /* 输出最小的一个数 */ public class jiaohuan1 { public static void main(String[] args) { int a =4; int b = 3; if(a<b){ // 第一种交换方式 int tmep =
渐变显示 bijian1013 JavaScript
<style type="text/css"> #wxf { FILTER: progid:DXImageTransform.Microsoft.Gradient(GradientType=0, StartColorStr=#ffffff, EndColorStr=#97FF98); height: 25px; } </style>
探索JUnit4扩展：断言语法assertThat bijian1013 java 单元测试 assertThat
一.概述 JUnit 设计的目的就是有效地抓住编程人员写代码的意图，然后快速检查他们的代码是否与他们的意图相匹配。 JUnit 发展至今，版本不停的翻新，但是所有版本都一致致力于解决一个问题，那就是如何发现编程人员的代码意图，并且如何使得编程人员更加容易地表达他们的代码意图。JUnit 4.4 也是为了如何能够
【Gson三】Gson解析{"data":{"IM":["MSN","QQ","Gtalk"]}} bit1129 gson
如何把如下简单的JSON字符串反序列化为Java的POJO对象? {"data":{"IM":["MSN","QQ","Gtalk"]}} 下面的POJO类Model无法完成正确的解析： import com.google.gson.Gson;
【Kafka九】Kafka High Level API vs. Low Level API bit1129 kafka
1. Kafka提供了两种Consumer API High Level Consumer API Low Level Consumer API(Kafka诡异的称之为Simple Consumer API，实际上非常复杂) 在选用哪种Consumer API时，首先要弄清楚这两种API的工作原理，能做什么不能做什么，能做的话怎么做的以及用的时候，有哪些可能的问题
在nginx中集成lua脚本：添加自定义Http头，封IP等 ronin47 nginx lua
Lua是一个可以嵌入到Nginx配置文件中的动态脚本语言，从而可以在Nginx请求处理的任何阶段执行各种Lua代码。刚开始我们只是用Lua 把请求路由到后端服务器，但是它对我们架构的作用超出了我们的预期。下面就讲讲我们所做的工作。强制搜索引擎只索引mixlr.com Google把子域名当作完全独立的网站，我们不希望爬虫抓取子域名的页面，降低我们的Page rank。 location /{
java-归并排序 bylijinnan java
import java.util.Arrays; public class MergeSort { public static void main(String[] args) { int[] a={20,1,3,8,5,9,4,25}; mergeSort(a,0,a.length-1); System.out.println(Arrays.to
Netty源码学习-CompositeChannelBuffer bylijinnan java netty
CompositeChannelBuffer体现了Netty的“Transparent Zero Copy” 查看API（ http://docs.jboss.org/netty/3.2/api/org/jboss/netty/buffer/package-summary.html#package_description）可以看到，所谓“Transparent Zero Copy”是通
Android中给Activity添加返回键 hotsunshine Activity
// this need android:minSdkVersion="11" getActionBar().setDisplayHomeAsUpEnabled(true); @Override public boolean onOptionsItemSelected(MenuItem item) {
静态页面传参 ctrain 静态
$(document).ready(function () { var request = { QueryString : function (val) { var uri = window.location.search; var re = new RegExp("" + val + "=([^&?]*)", &
Windows中查找某个目录下的所有文件中包含某个字符串的命令 daizj windows 查找某个目录下的所有文件包含某个字符串
findstr可以完成这个工作。 [html] view plain copy >findstr /s /i "string" *.* 上面的命令表示，当前目录以及当前目录的所有子目录下的所有文件中查找"string&qu
改善程序代码质量的一些技巧 dcj3sjt126com 编程 PHP 重构
有很多理由都能说明为什么我们应该写出清晰、可读性好的程序。最重要的一点，程序你只写一次，但以后会无数次的阅读。当你第二天回头来看你的代码时，你就要开始阅读它了。当你把代码拿给其他人看时，他必须阅读你的代码。因此，在编写时多花一点时间，你会在阅读它时节省大量的时间。让我们看一些基本的编程技巧：尽量保持方法简短尽管很多人都遵
SharedPreferences对数据的存储 dcj3sjt126com
SharedPreferences简介： &nbs
linux复习笔记之bash shell (2) bash基础 eksliang bash bash shell
转载请出自出处： http://eksliang.iteye.com/blog/2104329 1.影响显示结果的语系变量（locale） 1.1locale这个命令就是查看当前系统支持多少种语系，命令使用如下： [root@localhost shell]# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8"
Android零碎知识总结 gqdy365 android
1、CopyOnWriteArrayList add(E) 和remove(int index)都是对新的数组进行修改和新增。所以在多线程操作时不会出现java.util.ConcurrentModificationException错误。所以最后得出结论：CopyOnWriteArrayList适合使用在读操作远远大于写操作的场景里，比如缓存。发生修改时候做copy，新老版本分离，保证读的高
HoverTree.Model.ArticleSelect类的作用 hvt Web .net C#hovertree asp.net
ArticleSelect类在命名空间HoverTree.Model中可以认为是文章查询条件类，用于存放查询文章时的条件，例如HvtId就是文章的id。HvtIsShow就是文章的显示属性，当为-1是，该条件不产生作用，当为0时，查询不公开显示的文章，当为1时查询公开显示的文章。HvtIsHome则为是否在首页显示。HoverTree系统源码完全开放，开发环境为Visual Studio 2013
PHP 判断是否使用代理 PHP Proxy Detector 天梯梦 proxy
1. php 类 I found this class looking for something else actually but I remembered I needed some while ago something similar and I never found one. I'm sure it will help a lot of developers who try to
apache的math库中的回归——regression（翻译） lvdccyb Math apache
这个Math库，虽然不向weka那样专业的ML库，但是用户友好，易用。多元线性回归，协方差和相关性（皮尔逊和斯皮尔曼），分布测试（假设检验，t，卡方，G），统计。数学库中还包含，Cholesky，LU，SVD，QR，特征根分解，真不错。基本覆盖了：线代，统计，矩阵，最优化理论曲线拟合常微分方程遗传算法（GA），还有3维的运算。。。
基础数据结构和算法十三：Undirected Graphs (2) sunwinner Algorithm
Design pattern for graph processing. Since we consider a large number of graph-processing algorithms, our initial design goal is to decouple our implementations from the graph representation
云计算平台最重要的五项技术 sumapp 云计算云平台智城云
云计算平台最重要的五项技术 1、云服务器云服务器提供简单高效，处理能力可弹性伸缩的计算服务，支持国内领先的云计算技术和大规模分布存储技术，使您的系统更稳定、数据更安全、传输更快速、部署更灵活。特性机型丰富通过高性能服务器虚拟化为云服务器，提供丰富配置类型虚拟机，极大简化数据存储、数据库搭建、web服务器搭建等工作；仅需要几分钟，根据CP
《京东技术解密》有奖试读获奖名单公布 ITeye管理员活动
ITeye携手博文视点举办的12月技术图书有奖试读活动已圆满结束，非常感谢广大用户对本次活动的关注与参与。 12月试读活动回顾： http://webmaster.iteye.com/blog/2164754 本次技术图书试读活动获奖名单及相应作品如下：一等奖（两名） Microhardest：http://microhardest.ite