Library Hours
Monday to Friday: 9 a.m. to 9 p.m.
Saturday: 9 a.m. to 5 p.m.
Sunday: 1 p.m. to 9 p.m.
Naper Blvd. 1 p.m. to 5 p.m.
     
Limit search to available items
Results Page:  Previous Next
Author Perrin, Jean Georges.

Title Spark in Action [electronic resource] : Covers Apache Spark 3 with Examples in Java, Python, and Scala. [O'Reilly electronic resource]

Imprint New York : Manning Publications Co. LLC, 2020.
QR Code
Description 1 online resource (498 p.)
text file
Series ITpro collection
Note Description based upon print version of record.
Bibliography Includes bibliographical references.
Summary Spark in Action, Second Edition , teaches you to create end-to-end analytics applications. In this entirely new book, you'll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you'll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.
Contents Intro -- Copyright -- brief contents -- contents -- front matter -- foreword -- The analytics operating system -- preface -- acknowledgments -- about this book -- Who should read this book -- What will you learn in this book? -- How this book is organized -- About the code -- liveBook discussion forum -- about the author -- about the cover illustration -- Part 1. The theory crippled by awesome examples -- 1. So, what is Spark, anyway? -- 1.1 The big picture: What Spark is and what it does -- 1.1.1 What is Spark? -- 1.1.2 The four pillars of mana -- 1.2 How can you use Spark? -- 1.2.1 Spark in a data processing/engineering scenario -- 1.2.2 Spark in a data science scenario -- 1.3 What can you do with Spark? -- 1.3.1 Spark predicts restaurant quality at NC eateries -- 1.3.2 Spark allows fast data transfer for Lumeris -- 1.3.3 Spark analyzes equipment logs for CERN -- 1.3.4 Other use cases -- 1.4 Why you will love the dataframe -- 1.4.1 The dataframe from a Java perspective -- 1.4.2 The dataframe from an RDBMS perspective -- 1.4.3 A graphical representation of the dataframe -- 1.5 Your first example -- 1.5.1 Recommended software -- 1.5.2 Downloading the code -- 1.5.3 Running your first application -- Command line -- Eclipse -- 1.5.4 Your first code -- Summary -- 2. Architecture and flow -- 2.1 Building your mental model -- 2.2 Using Java code to build your mental model -- 2.3 Walking through your application -- 2.3.1 Connecting to a master -- 2.3.2 Loading, or ingesting, the CSV file -- 2.3.3 Transforming your data -- 2.3.4 Saving the work done in your dataframe to a database -- Summary -- 3. The majestic role of the dataframe -- 3.1 The essential role of the dataframe in Spark -- 3.1.1 Organization of a dataframe -- 3.1.2 Immutability is not a swear word -- 3.2 Using dataframes through examples -- 3.2.1 A dataframe after a simple CSV ingestion.
6.2.2 Setting up the environment -- 6.3 Building your application to run on the cluster -- 6.3.1 Building your application's uber JAR -- 6.3.2 Building your application by using Git and Maven -- 6.4 Running your application on the cluster -- 6.4.1 Submitting the uber JAR -- 6.4.2 Running the application -- 6.4.3 the Spark user interface -- Summary -- Part 2. Ingestion -- 7. Ingestion from files -- 7.1 Common behaviors of parsers -- 7.2 Complex ingestion from CSV -- 7.2.1 Desired output -- 7.2.2 Code -- 7.3 Ingesting a CSV with a known schema -- 7.3.1 Desired output -- 7.3.2 Code -- 7.4 Ingesting a JSON file -- 7.4.1 Desired output -- 7.4.2 Code -- 7.5 Ingesting a multiline JSON file -- 7.5.1 Desired output -- 7.5.2 Code -- 7.6 Ingesting an XML file -- 7.6.1 Desired output -- 7.6.2 Code -- 7.7 Ingesting a text file -- 7.7.1 Desired output -- 7.7.2 Code -- 7.8 File formats for big data -- 7.8.1 The problem with traditional file formats -- 7.8.2 Avro is a schema-based serialization format -- 7.8.3 ORC is a columnar storage format -- 7.8.4 Parquet is also a columnar storage format -- 7.8.5 Comparing Avro, ORC, and Parquet -- 7.9 Ingesting Avro, ORC, and Parquet files -- 7.9.1 Ingesting Avro -- 7.9.2 Ingesting ORC -- 7.9.3 Ingesting Parquet -- 7.9.4 Reference table for ingesting Avro, ORC, or Parquet -- Summary -- 8. Ingestion from databases -- 8.1 Ingestion from relational databases -- 8.1.1 Database connection checklist -- 8.1.2 Understanding the data used in the examples -- 8.1.3 Desired output -- 8.1.4 Code -- 8.1.5 Alternative code -- 8.2 The role of the dialect -- 8.2.1 What is a dialect, anyway? -- 8.2.2 JDBC dialects provided with Spark -- 8.2.3 Building your own dialect -- 8.3 Advanced queries and ingestion -- 8.3.1 Filtering by using a WHERE clause -- 8.3.2 Joining data in the database -- 8.3.3 Performing Ingestion and partitioning.
8.3.4 Summary of advanced features -- 8.4 Ingestion from Elasticsearch -- 8.4.1 Data flow -- 8.4.2 The New York restaurants dataset digested by Spark -- 8.4.3 Code to ingest the restaurant dataset from Elasticsearch -- Summary -- 9 Advanced ingestion: finding data sources and building your own -- 9.1 What is a data source? -- 9.2 Benefits of a direct connection to a data source -- 9.2.1 Temporary files -- 9.2.2 Data quality scripts -- 9.2.3 Data on demand -- 9.3 Finding data sources at Spark Packages -- 9.4 Building your own data source -- 9.4.1 Scope of the example project -- 9.4.2 Your data source API and options -- 9.5 Behind the scenes: Building the data source itself -- 9.6 Using the register file and the advertiser class -- 9.7 Understanding the relationship between the data and schema -- 9.7.1 The data source builds the relation -- 9.7.2 Inside the relation -- 9.8 Building the schema from a JavaBean -- 9.9 Building the dataframe is magic with the utilities -- 9.10 The other classes -- Summary -- 10. Ingestion through structured streaming -- 10.1 What's streaming? -- 10.2 Creating your first stream -- 10.2.1 Generating a file stream -- 10.2.2 Consuming the records -- 10.2.3 Getting records, not lines -- 10.3 Ingesting data from network streams -- 10.4 Dealing with multiple streams -- 10.5 Differentiating discretized and structured streaming -- Summary -- Part 3. Transforming your data -- 11. Working with SQL -- 11.1 Working with Spark SQL -- 11.2 The difference between local and global views -- 11.3 Mixing the dataframe API and Spark SQL -- 11.4 Don't DELETE it! -- 11.5 Going further with SQL -- Summary -- 12 Transforming your data -- 12.1 What is data transformation? -- 12.2 Process and example of record-level transformation -- 12.2.1 Data discovery to understand the complexity -- 12.2.2 Data mapping to draw the process.
12.2.3 Writing the transformation code -- 12.2.4 Reviewing your data transformation to ensure a quality process -- What about sorting? -- Wrapping up your first Spark transformation -- 12.3 Joining datasets -- 12.3.1 A closer look at the datasets to join -- 12.3.2 Building the list of higher education institutions per county -- Initialization of Spark -- Loading and preparing the data -- 12.3.3 Performing the joins -- Joining the FIPS county identifier with the higher ed dataset using a join -- Joining the census data to get the county name -- 12.4 Performing more transformations -- Summary -- 13. Transforming entire documents -- 13.1 Transforming entire documents and their structure -- 13.1.1 Flattening your JSON document -- 13.1.2 Building nested documents for transfer and storage -- 13.2 The magic behind static functions -- 13.3 Performing more transformations -- Summary -- 14. Extending transformations with user-defined functions -- 14.1 Extending Apache Spark -- 14.2 Registering and calling a UDF -- 14.2.1 Registering the UDF with Spark -- 14.2.2 Using the UDF with the dataframe API -- 14.2.3 Manipulating UDFs with SQL -- 14.2.4 Implementing the UDF -- 14.2.5 Writing the service itself -- 14.3 Using UDFs to ensure a high level of data quality -- 14.4 Considering UDFs' constraints -- Summary -- 15. Aggregating your data -- 15.1 Aggregating data with Spark -- 15.1.1 A quick reminder on aggregations -- 15.1.2 Performing basic aggregations with Spark -- Performing an aggregation using the dataframe API -- Performing an aggregation using Spark SQL -- 15.2 Performing aggregations with live data -- 15.2.1 Preparing your dataset -- 15.2.2 Aggregating data to better understand the schools -- What is the average enrollment for each school? -- What is the evolution of the number of students? -- What is the higher enrollment per school and year?.
Subject Big data.
Data mining -- Computer programs.
Données volumineuses.
Exploration de données (Informatique) -- Logiciels.
Big data
Other Form: Print version: Perrin, Jean Georges Spark in Action New York : Manning Publications Co. LLC,c2020 9781617295522
ISBN 9781638351306
1638351309
Standard No. 9781617295522
Patron reviews: add a review
Click for more information
EBOOK
No one has rated this material

You can...
Also...
- Find similar reads
- Add a review
- Sign-up for Newsletter
- Suggest a purchase
- Can't find what you want?
More Information