
eTEXT
Python for Data Science For Dummies
By: John Paul Mueller, Luca Massaron
eText | 29 January 2019 | Edition Number 2
At a Glance
ePUB
eText
$41.80
or 4 interest-free payments of $10.45 with
orInstant online reading in your Booktopia eTextbook Library *
Read online on
Not downloadable to your eReader or an app
Why choose an eTextbook?
Instant Access *
Purchase and read your book immediately
Read Aloud
Listen and follow along as Bookshelf reads to you
Study Tools
Built-in study tools like highlights and more
* eTextbooks are not downloadable to your eReader or an app and can be accessed via web browsers only. You must be connected to the internet and have no technical issues with your device or browser that could prevent the eTextbook from operating.
The fast and easy way to learn Python programming and statistics
Python is a general-purpose programming language created in the late 1980s—and named after Monty Python—that's used by thousands of people to do things from testing microchips at Intel, to powering Instagram, to building video games with the PyGame library.
Python For Data Science For Dummies is written for people who are new to data analysis, and discusses the basics of Python data analysis programming and statistics. The book also discusses Google Colab, which makes it possible to write Python code in the cloud.
- Get started with data science and Python
- Visualize information
- Wrangle data
- Learn from data
The book provides the statistical background needed to get started in data science programming, including probability, random distributions, hypothesis testing, confidence intervals, and building regression models for prediction.
Read online on
Introduction 1
About This Book 1
Foolish Assumptions 3
Icons Used in This Book 4
Beyond the Book 4
Where to Go from Here 5
Part 1: Getting Started With Data Science and Python 7
Chapter 1: Discovering the Match between Data Science and Python 9
Defining the Sexiest Job of the 21st Century 11
Considering the emergence of data science 12
Outlining the core competencies of a data scientist 12
Linking data science, big data, and AI 13
Understanding the role of programming 14
Creating the Data Science Pipeline 14
Preparing the data 15
Performing exploratory data analysis 15
Learning from data 15
Visualizing 15
Obtaining insights and data products 16
Understanding Python's Role in Data Science 16
Considering the shifting profile of data scientists 16
Working with a multipurpose, simple, and efficient language 17
Learning to Use Python Fast 18
Loading data 19
Training a model 19
Viewing a result 19
Chapter 2: Introducing Python's Capabilities and Wonders 21
Why Python? 22
Grasping Python's Core Philosophy 23
Contributing to data science 23
Discovering present and future development goals 24
Working with Python 25
Getting a taste of the language 25
Understanding the need for indentation 26
Working at the command line or in the IDE 27
Performing Rapid Prototyping and Experimentation 31
Considering Speed of Execution 32
Visualizing Power 33
Using the Python Ecosystem for Data Science 35
Accessing scientific tools using SciPy 35
Performing fundamental scientific computing using NumPy 36
Performing data analysis using pandas 36
Implementing machine learning using Scikit-learn 36
Going for deep learning with Keras and TensorFlow 37
Plotting the data using matplotlib 38
Creating graphs with NetworkX 38
Parsing HTML documents using Beautiful Soup 38
Chapter 3: Setting Up Python for Data Science 39
Considering the Off-the-Shelf Cross-Platform Scientific Distributions 40
Getting Continuum Analytics Anaconda 40
Getting Enthought Canopy Express 41
Getting WinPython 42
Installing Anaconda on Windows 42
Installing Anaconda on Linux 46
Installing Anaconda on Mac OS X 47
Downloading the Datasets and Example Code 48
Using Jupyter Notebook 49
Defining the code repository 50
Understanding the datasets used in this book 57
Chapter 4: Working with Google Colab 59
Defining Google Colab 60
Understanding what Google Colab does 60
Considering the online coding difference 61
Using local runtime support 63
Getting a Google Account 63
Creating the account 64
Signing in 64
Working with Notebooks 65
Creating a new notebook 65
Opening existing notebooks 66
Saving notebooks 68
Downloading notebooks 71
Performing Common Tasks 71
Creating code cells 71
Creating text cells 72
Creating special cells 73
Editing cells 74
Moving cells 75
Using Hardware Acceleration 75
Executing the Code 76
Viewing Your Notebook 76
Displaying the table of contents 77
Getting notebook information 77
Checking code execution 78
Sharing Your Notebook 79
Getting Help 80
Part 2: Getting Your Hands Dirty With Data 81
Chapter 5: Understanding the Tools 83
Using the Jupyter Console 84
Interacting with screen text 84
Changing the window appearance 86
Getting Python help 87
Getting IPython help 89
Using magic functions 90
Discovering objects 91
Using Jupyter Notebook 93
Working with styles 93
Restarting the kernel 94
Restoring a checkpoint 95
Performing Multimedia and Graphic Integration 96
Embedding plots and other images 96
Loading examples from online sites 96
Obtaining online graphics and multimedia 96
Chapter 6: Working with Real Data 99
Uploading, Streaming, and Sampling Data 100
Uploading small amounts of data into memory 101
Streaming large amounts of data into memory 102
Generating variations on image data 103
Sampling data in different ways 104
Accessing Data in Structured Flat-File Form 105
Reading from a text file 106
Reading CSV delimited format 107
Reading Excel and other Microsoft Office files 109
Sending Data in Unstructured File Form 111
Managing Data from Relational Databases 113
Interacting with Data from NoSQL Databases 115
Accessing Data from the Web 116
Chapter 7: Conditioning Your Data 121
Juggling between NumPy and pandas 122
Knowing when to use NumPy 122
Knowing when to use pandas 122
Validating Your Data 124
Figuring out what's in your data 124
Removing duplicates 126
Creating a data map and data plan 126
Manipulating Categorical Variables 129
Creating categorical variables 130
Renaming levels 131
Combining levels 132
Dealing with Dates in Your Data 133
Formatting date and time values 134
Using the right time transformation 135
Dealing with Missing Data 136
Finding the missing data 136
Encoding missingness 137
Imputing missing data 138
Slicing and Dicing: Filtering and Selecting Data 139
Slicing rows 140
Slicing columns 140
Dicing 141
Concatenating and Transforming 142
Adding new cases and variables 142
Removing data 144
Sorting and shuffling 145
Aggregating Data at Any Level 146
Chapter 8: Shaping Data 149
Working with HTML Pages 150
Parsing XML and HTML 150
Using XPath for data extraction 151
Working with Raw Text 153
Dealing with Unicode 153
Stemming and removing stop words 153
Introducing regular expressions 155
Using the Bag of Words Model and Beyond 158
Understanding the bag of words model 159
Working with n-grams 161
Implementing TF-IDF transformations 162
Working with Graph Data 165
Understanding the adjacency matrix 165
Using NetworkX basics 166
Chapter 9: Putting What You Know in Action 169
Contextualizing Problems and Data 170
Evaluating a data science problem 171
Researching solutions 173
Formulating a hypothesis 174
Preparing your data 175
Considering the Art of Feature Creation 175
Defining feature creation 175
Combining variables 176
Understanding binning and discretization 177
Using indicator variables 177
Transforming distributions 178
Performing Operations on Arrays 178
Using vectorization 179
Performing simple arithmetic on vectors and matrices 179
Performing matrix vector multiplication 180
Performing matrix multiplication 181
Part 3: Visualizing Information 183
Chapter 10: Getting a Crash Course in MatPlotLib 185
Starting with a Graph 186
Defining the plot 186
Drawing multiple lines and plots 187
Saving your work to disk 188
Setting the Axis, Ticks, Grids 189
Getting the axes 189
Formatting the axes 190
Adding grids 191
Defining the Line Appearance 192
Working with line styles 193
Using colors 194
Adding markers 195
Using Labels, Annotations, and Legends 197
Adding labels 198
Annotating the chart 198
Creating a legend 199
Chapter 11: Visualizing the Data 201
Choosing the Right Graph 202
Showing parts of a whole with pie charts 202
Creating comparisons with bar charts 203
Showing distributions using histograms 205
Depicting groups using boxplots 206
Seeing data patterns using scatterplots 208
Creating Advanced Scatterplots 209
Depicting groups 209
Showing correlations 211
Plotting Time Series 212
Representing time on axes 212
Plotting trends over time 214
Plotting Geographical Data 216
Using an environment in Notebook 217
Getting the Basemap toolkit 218
Dealing with deprecated library issues 218
Using Basemap to plot geographic data 220
Visualizing Graphs 221
Developing undirected graphs 222
Developing directed graphs 224
Part 4: Wrangling Data 227
Chapter 12: Stretching Python's Capabilities 229
Playing with Scikit-learn 230
Understanding classes in Scikit-learn 230
Defining applications for data science 231
Performing the Hashing Trick 234
Using hash functions 235
Demonstrating the hashing trick 235
Working with deterministic selection 239
Considering Timing and Performance 240
Benchmarking with timeit 241
Working with the memory profiler 244
Running in Parallel on Multiple Cores 247
Performing multicore parallelism 248
Demonstrating multiprocessing 248
Chapter 13: Exploring Data Analysis 251
The EDA Approach 252
Defining Descriptive Statistics for Numeric Data 253
Measuring central tendency 254
Measuring variance and range 255
Working with percentiles 256
Defining measures of normality 257
Counting for Categorical Data 259
Understanding frequencies 259
Creating contingency tables 261
Creating Applied Visualization for EDA 261
Inspecting boxplots 262
Performing t-tests after boxplots 263
Observing parallel coordinates 264
Graphing distributions 265
Plotting scatterplots 266
Understanding Correlation 268
Using covariance and correlation 268
Using nonparametric correlation 270
Considering the chi-square test for tables 271
Modifying Data Distributions 272
Using different statistical distributions 272
Creating a Z-score standardization 273
Transforming other notable distributions 273
Chapter 14: Reducing Dimensionality 275
Understanding SVD 276
Looking for dimensionality reduction 277
Using SVD to measure the invisible 279
Performing Factor Analysis and PCA 280
Considering the psychometric model 280
Looking for hidden factors 281
Using components, not factors 282
Achieving dimensionality reduction 282
Squeezing information with t-SNE 283
Understanding Some Applications 285
Recognizing faces with PCA 285
Extracting topics with NMF 289
Recommending movies 291
Chapter 15: Clustering 295
Clustering with K-means 297
Understanding centroid-based algorithms 298
Creating an example with image data 299
Looking for optimal solutions 301
Clustering big data 304
Performing Hierarchical Clustering 305
Using a hierarchical cluster solution 307
Using a two-phase clustering solution 308
Discovering New Groups with DBScan 310
Chapter 16: Detecting Outliers in Data 313
Considering Outlier Detection 314
Finding more things that can go wrong 315
Understanding anomalies and novel data 316
Examining a Simple Univariate Method 317
Leveraging on the Gaussian distribution 319
Making assumptions and checking out 320
Developing a Multivariate Approach 322
Using principal component analysis 322
Using cluster analysis for spotting outliers 324
Automating detection with Isolation Forests 325
Part 5: Learning From Data 327
Chapter 17: Exploring Four Simple and Effective Algorithms 329
Guessing the Number: Linear Regression 329
Defining the family of linear models 330
Using more variables 331
Understanding limitations and problems 333
Moving to Logistic Regression 334
Applying logistic regression 335
Considering when classes are more 336
Making Things as Simple as Naïve Bayes 337
Finding out that Naïve Bayes isn't so naïve 339
Predicting text classifications 340
Learning Lazily with Nearest Neighbors 342
Predicting after observing neighbors 343
Choosing your k parameter wisely 344
Chapter 18: Performing Cross-Validation, Selection, and Optimization 347
Pondering the Problem of Fitting a Model 348
Understanding bias and variance 349
Defining a strategy for picking models 350
Dividing between training and test sets 354
Cross-Validating 356
Using cross-validation on k folds 357
Sampling stratifications for complex data 358
Selecting Variables Like a Pro 360
Selecting by univariate measures 360
Using a greedy search 362
Pumping Up Your Hyperparameters 363
Implementing a grid search 364
Trying a randomized search 368
Chapter 19: Increasing Complexity with Linear and Nonlinear Tricks 371
Using Nonlinear Transformations 372
Doing variable transformations 372
Creating interactions between variables 375
Regularizing Linear Models 379
Relying on Ridge regression (L2) 380
Using the Lasso (L1) 381
Leveraging regularization 382
Combining L1 & L2: Elasticnet 382
Fighting with Big Data Chunk by Chunk 383
Determining when there is too much data 383
Implementing Stochastic Gradient Descent 383
Understanding Support Vector Machines 387
Relying on a computational method 387
Fixing many new parameters 390
Classifying with SVC 392
Going nonlinear is easy 398
Performing regression with SVR 399
Creating a stochastic solution with SVM 401
Playing with Neural Networks 406
Understanding neural networks 407
Classifying and regressing with neurons 408
Chapter 20: Understanding the Power of the Many 411
Starting with a Plain Decision Tree 412
Understanding a decision tree 412
Creating trees for different purposes 415
Making Machine Learning Accessible 418
Working with a Random Forest classifier 420
Working with a Random Forest regressor 421
Optimizing a Random Forest 422
Boosting Predictions 424
Knowing that many weak predictors win 424
Setting a gradient boosting classifier 425
Running a gradient boosting regressor 426
Using GBM hyperparameters 427
Part 6: The Part of Tens 429
Chapter 21: Ten Essential Data Resources 431
Discovering the News with Subreddit 432
Getting a Good Start with KDnuggets 432
Locating Free Learning Resources with Quora 432
Gaining Insights with Oracle's Data Science Blog 433
Accessing the Huge List of Resources on Data Science Central 433
Learning New Tricks from the Aspirational Data Scientist 434
Obtaining the Most Authoritative Sources at Udacity 435
Receiving Help with Advanced Topics at Conductrics 435
Obtaining the Facts of Open Source Data Science from Masters 436
Zeroing In on Developer Resources with Jonathan Bower 436
Chapter 22: Ten Data Challenges You Should Take 437
Meeting the Data Science London + Scikit-learn Challenge 438
Predicting Survival on the Titanic 438
Finding a Kaggle Competition that Suits Your Needs 439
Honing Your Overfit Strategies 440
Trudging Through the MovieLens Dataset 440
Getting Rid of Spam E-mails 441
Working with Handwritten Information 442
Working with Pictures 443
Analyzing Amazon.com Reviews 444
Interacting with a Huge Graph 444
Index 447
ISBN: 9781119547662
ISBN-10: 1119547660
Published: 29th January 2019
Format: ePUB
Language: English
Audience: Professional and Scholarly
Publisher: Wiley Professional Development (P&T)
Country of Publication: US
Edition Number: 2