Java Spark Dataset Examples

Dataframes explained: The modern in-memory data science format

Most people are familiar with data in the form of a spreadsheet, with labeled columns of different data types such as name, address, age, and so on. Databases work the same way, with each table laid ...

Nature

A hybrid recommendation algorithm based on user nearest neighbor model

In the realm of e-commerce, personalized recommendations are a crucial component in enhancing user experience and optimizing sales efficiency. To address the inherent sparsity challenge prevalent in ...

GitHub

GoogleCloudDataproc/spark-bigquery-connector

The Storage API streams data in parallel directly from BigQuery via gRPC without using Google Cloud Storage as an intermediary. It has a number of advantages over using the previous export-based read ...

Linux Journal

Harnessing the Power of Big Data: Exploring Linux Data Science with Apache Spark and Jupyter

Big data refers to datasets that are too large, complex, or fast-changing to be handled by traditional data processing tools. It is characterized by the four V's: Big data analytics plays a crucial ...

Polars vs Spark: The Good, the Bad, and the Ugly

Polars is often compared to Spark. In this post, I will highlight the main differences and the best use cases for each in my data engineering activities. As a Data Engineer, I primarily focus on the ...

Ocelot: Scaling observational causal inference at LinkedIn

At Linkedin, we constantly evaluate the value our products and services deliver, so that we can provide the best possible experiences for our members and customers. This includes understanding how ...

Nature

RS-FISH: precise, interactive, fast, and scalable FISH spot detection

Fluorescent in-situ hybridization (FISH)-based methods extract spatially resolved genetic and epigenetic information from biological samples by detecting fluorescent spots in microscopy images, an ...

Microsoft

SynapseML: A simple, multilingual, and massively parallel machine learning library

Today, we’re excited to announce the release of SynapseML (previously MMLSpark), an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines. Building ...

GitHub

Apache Spark - Apache HBase Connector

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink. With it, user can operate HBase with Spark-SQL on DataFrame and DataSet ...

InfoQ

Traffic Data Monitoring Using IoT, Kafka and Spark Streaming

A monthly overview of things you need to know as an architect or aspiring architect. Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results