Library Hours
Monday to Friday: 9 a.m. to 9 p.m.
Saturday: 9 a.m. to 5 p.m.
Sunday: 1 p.m. to 9 p.m.
Naper Blvd. 1 p.m. to 5 p.m.
     
Limit search to available items
Results Page:  Previous Next
Author Drabas, Tomasz, author.

Title Learning PySpark : build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 / Tomasz Drabas, Denny Lee ; foreword by Holden Karau. [O'Reilly electronic resource]

Publication Info. Birmingham, UK : Packt Publishing, 2017.
QR Code
Description 1 online resource (1 volume) : illustrations, maps
Note Includes index.
Contents Cover -- Copyright -- Credits -- Foreword -- About the Authors -- About the Reviewer -- www.PacktPub.com -- Customer Feedback -- Table of Contents -- Preface -- Chapter 1: Understanding Spark -- What is Apache Spark? -- Spark Jobs and APIs -- Execution process -- Resilient Distributed Dataset -- DataFrames -- Datasets -- Catalyst Optimizer -- Project Tungsten -- Spark 2.0 architecture -- Unifying Datasets and DataFrames -- Introducing SparkSession -- Tungsten phase 2 -- Structured streaming -- Continuous applications -- Summary -- Chapter 2: Resilient Distributed Datasets -- Internal workings of an RDD -- Creating RDDs -- Schema -- Reading from files -- Lambda expressions -- Global versus local scope -- Transformations -- The .map(...) transformation -- The .filter(...) transformation -- The .flatMap(...) transformation -- The .distinct(...) transformation -- The .sample(...) transformation -- The .leftOuterJoin(...) transformation -- The .repartition(...) transformation -- Actions -- The .take(...) method -- The .collect(...) method -- The .reduce(...) method -- The .count(...) method -- The .saveAsTextFile(...) method -- The .foreach(...) method -- Summary -- Chapter 3: DataFrames -- Python to RDD communications -- Catalyst Optimizer refresh -- Speeding up PySpark with DataFrames -- Creating DataFrames -- Generating our own JSON data -- Creating a DataFrame -- Creating a temporary table -- Simple DataFrame queries -- DataFrame API query -- SQL query -- Interoperating with RDDs -- Inferring the schema using reflection -- Programmatically specifying the schema -- Querying with the DataFrame API -- Number of rows -- Running filter statements -- Querying with SQL -- Number of rows -- Running filter statements using the where Clauses -- DataFrame scenario -- on-time flight performance -- Preparing the source datasets.
Joining flight performance and airports -- Visualizing our flight-performance data -- Spark Dataset API -- Summary -- Chapter 4: Prepare Data for Modeling -- Checking for duplicates, missing observations, and outliers -- Duplicates -- Missing observations -- Outliers -- Getting familiar with your data -- Descriptive statistics -- Correlations -- Visualization -- Histograms -- Interactions between features -- Summary -- Chapter 5: Introducing MLlib -- Overview of the package -- Loading and transforming the data -- Getting to know your data -- Descriptive statistics -- Correlations -- Statistical testing -- Creating the final dataset -- Creating an RDD of LabeledPoints -- Splitting into training and testing -- Predicting infant survival -- Logistic regression in MLlib -- Selecting only the most predictable features -- Random forest in MLlib -- Summary -- Chapter 6: Introducing the ML Package -- Overview of the package -- Transformer -- Estimators -- Classification -- Regression -- Clustering -- Pipeline -- Predicting the chances of infant survival with ML -- Loading the data -- Creating transformers -- Creating an estimator -- Creating a pipeline -- Fitting the model -- Evaluating the performance of the model -- Saving the model -- Parameter hyper-tuning -- Grid search -- Train-validation splitting -- Other features of PySpark ML in action -- Feature extraction -- NLP -- related feature extractors -- Discretizing continuous variables -- Standardizing continuous variables -- Classification -- Clustering -- Finding clusters in the births dataset -- Topic mining -- Regression -- Summary -- Chapter 7: GraphFrames -- Introducing GraphFrames -- Installing GraphFrames -- Creating a library -- Preparing your flights dataset -- Building the graph -- Executing simple queries -- Determining the number of airports and trips.
Determining the longest delay in this dataset -- Determining the number of delayed versus on-time/early flights -- What flights departing Seattle are most likely to have significant delays? -- What states tend to have significant delays departing from Seattle? -- Understanding vertex degrees -- Determining the top transfer airports -- Understanding motifs -- Determining airport ranking using PageRank -- Determining the most popular non-stop flights -- Using Breadth-First Search -- Visualizing flights using D3 -- Summary -- Chapter 8: TensorFrames -- What is Deep Learning? -- The need for neural networks and Deep Learning -- What is feature engineering? -- Bridging the data and algorithm -- What is TensorFlow? -- Installing Pip -- Installing TensorFlow -- Matrix multiplication using constants -- Matrix multiplication using placeholders -- Running the model -- Running another model -- Discussion -- Introducing TensorFrames -- TensorFrames -- quick start -- Configuration and setup -- Launching a Spark cluster -- Creating a TensorFrames library -- Installing TensorFlow on your cluster -- Using TensorFlow to add a constant to an existing column -- Executing the Tensor graph -- Blockwise reducing operations example -- Building a DataFrame of vectors -- Analysing the DataFrame -- Computing elementwise sum and min of all vectors -- Summary -- Chapter 9: Polyglot Persistence with Blaze -- Installing Blaze -- Polyglot persistence -- Abstracting data -- Working with NumPy arrays -- Working with pandas' DataFrame -- Working with files -- Working with databases -- Interacting with relational databases -- Interacting with the MongoDB database -- Data operations -- Accessing columns -- Symbolic transformations -- Operations on columns -- Reducing data -- Joins -- Summary -- Chapter 10: Structured Streaming -- What is Spark Streaming?.
Why do we need Spark Streaming? -- What is the Spark Streaming application data flow? -- Simple streaming application using DStreams -- A quick primer on global aggregations -- Introducing Structured Streaming -- Summary -- Chapter 11: Packaging Spark Applications -- The spark-submit command -- Command line parameters -- Deploying the app programmatically -- Configuring your SparkSession -- Creating SparkSession -- Modularizing code -- Structure of the module -- Calculating the distance between two points -- Converting distance units -- Building an egg -- User defined functions in Spark -- Submitting a job -- Monitoring execution -- Databricks Jobs -- Summary -- Index.
Summary Annotation Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0 About This Book - Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0 - Develop and deploy efficient, scalable real-time Spark solutions - Take your understanding of using Spark with Python to the next level with this jump start guide Who This Book Is For If you are a Python developer who wants to learn about the Apache Spark 2.0 ecosystem, this book is for you. A firm understanding of Python is expected to get the best out of the book. Familiarity with Spark would be useful, but is not mandatory. What You Will Learn - Learn about Apache Spark and the Spark 2.0 architecture - Build and interact with Spark DataFrames using Spark SQL - Learn how to solve graph and deep learning problems using GraphFrames and TensorFrames respectively - Read, transform, and understand data and use it to train machine learning models - Build machine learning models with MLlib and ML - Learn how to submit your applications programmatically using spark-submit - Deploy locally built applications to a cluster In Detail Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications. Style and approach This book takes a very comprehensive, step-by-step approach so you understand how the Spark ecosystem can be used with Python to develop efficient, scalable solutions. Every chapter is standalone and written in a very easy-to-understand manner, with a focus on both the hows and the whys of each concept.
Subject Application software -- Development.
Python (Computer program language)
SPARK (Computer program language)
Logiciels d'application -- Développement.
Python (Langage de programmation)
Application software -- Development
Python (Computer program language)
SPARK (Computer program language)
Added Author Lee, Denny, author.
Karau, Holden, writer of foreword.
Other Form: Print version: Drabas, Tomasz. Learning PySpark. Birmingham : Packt Publishing, ©2017
ISBN 9781786466259 (electronic bk.)
1786466252 (electronic bk.)
Patron reviews: add a review
Click for more information
EBOOK
No one has rated this material

You can...
Also...
- Find similar reads
- Add a review
- Sign-up for Newsletter
- Suggest a purchase
- Can't find what you want?
More Information