OSDN > Developer > larry77 > Chambre > myprojects-hg-reborn > File Details

myprojects-hg-reborn
Fork

Dépôt original, Pas de origine de fork

File Info

Révision	2c007f732b7bab7aa98c765d88647b0014c2bdcf
Taille	940 octets
l'heure	2015-03-26 01:36:49
Auteur	Lorenzo Isella
Message de Log	A simple script to convert the test and train datasets (without the target values!) to a numerical matrix based on the term frequency–inverse document frequency.

Content

Export as raw format

#! /usr/bin/env python



import pandas as pd
import numpy as np
from sklearn import ensemble, feature_extraction, preprocessing


# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
#sample = pd.read_csv('sampleSubmission.csv')

# drop ids and get labels
labels = train.target.values
#labels2=np.copy(labels)
train = train.drop('id', axis=1)
train = train.drop('target', axis=1)
test = test.drop('id', axis=1)

# transform counts to TFIDF features
tfidf = feature_extraction.text.TfidfTransformer()
train = tfidf.fit_transform(train).toarray()
test = tfidf.transform(test).toarray()

#labels=labels.reshape(-1,1)

# train=np.hstack((train,labels))

# train=pd.DataFrame(train)
# test=pd.DataFrame(test)


np.savetxt("train-tfidf.csv", train, delimiter=",")
np.savetxt("test-tfidf.csv", test, delimiter=",")

# train.to_csv("train-tfidf.csv", train)
# test.to_csv("test-tfidf.csv", test)



print "So far so good"

myprojects-hg-reborn Fork

Tags

Frequently used words (click to add to your profile)

File Info

Content

myprojects-hg-reborn
Fork