public class KeyValueGroupedDataset<K,V>
extends Object
implements scala.Serializable
Dataset
has been logically grouped by a user specified grouping key. Users should not
construct a KeyValueGroupedDataset
directly, but should instead call groupByKey
on
an existing Dataset
.
Modifier and Type | Method and Description |
---|---|
<U1> Dataset<scala.Tuple2<K,U1>> |
agg(TypedColumn<V,U1> col1)
Computes the given aggregation, returning a
Dataset of tuples for each unique key
and the result of computing this aggregation over all elements in the group. |
<U1,U2> Dataset<scala.Tuple3<K,U1,U2>> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3> Dataset<scala.Tuple4<K,U1,U2,U3>> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3,U4> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3,
TypedColumn<V,U4> col4)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3,U4,U5> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3,
TypedColumn<V,U4> col4,
TypedColumn<V,U5> col5)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3,U4,U5,U6> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3,
TypedColumn<V,U4> col4,
TypedColumn<V,U5> col5,
TypedColumn<V,U6> col6)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3,U4,U5,U6,U7> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3,
TypedColumn<V,U4> col4,
TypedColumn<V,U5> col5,
TypedColumn<V,U6> col6,
TypedColumn<V,U7> col7)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3,U4,U5,U6,U7,U8> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3,
TypedColumn<V,U4> col4,
TypedColumn<V,U5> col5,
TypedColumn<V,U6> col6,
TypedColumn<V,U7> col7,
TypedColumn<V,U8> col8)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U,R> Dataset<R> |
cogroup(KeyValueGroupedDataset<K,U> other,
CoGroupFunction<K,V,U,R> f,
Encoder<R> encoder)
(Java-specific)
Applies the given function to each cogrouped data.
|
<U,R> Dataset<R> |
cogroup(KeyValueGroupedDataset<K,U> other,
scala.Function3<K,scala.collection.Iterator<V>,scala.collection.Iterator<U>,scala.collection.TraversableOnce<R>> f,
Encoder<R> evidence$11)
(Scala-specific)
Applies the given function to each cogrouped data.
|
Dataset<scala.Tuple2<K,Object>> |
count()
Returns a
Dataset that contains a tuple with each key and the number of items present
for that key. |
<U> Dataset<U> |
flatMapGroups(FlatMapGroupsFunction<K,V,U> f,
Encoder<U> encoder)
(Java-specific)
Applies the given function to each group of data.
|
<U> Dataset<U> |
flatMapGroups(scala.Function2<K,scala.collection.Iterator<V>,scala.collection.TraversableOnce<U>> f,
Encoder<U> evidence$3)
(Scala-specific)
Applies the given function to each group of data.
|
<S,U> Dataset<U> |
flatMapGroupsWithState(FlatMapGroupsWithStateFunction<K,V,S,U> func,
OutputMode outputMode,
Encoder<S> stateEncoder,
Encoder<U> outputEncoder,
GroupStateTimeout timeoutConf)
(Java-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state.
|
<S,U> Dataset<U> |
flatMapGroupsWithState(OutputMode outputMode,
GroupStateTimeout timeoutConf,
scala.Function3<K,scala.collection.Iterator<V>,GroupState<S>,scala.collection.Iterator<U>> func,
Encoder<S> evidence$9,
Encoder<U> evidence$10)
(Scala-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state.
|
<L> KeyValueGroupedDataset<L,V> |
keyAs(Encoder<L> evidence$1)
Returns a new
KeyValueGroupedDataset where the type of the key has been mapped to the
specified type. |
Dataset<K> |
keys()
Returns a
Dataset that contains each unique key. |
<U> Dataset<U> |
mapGroups(scala.Function2<K,scala.collection.Iterator<V>,U> f,
Encoder<U> evidence$4)
(Scala-specific)
Applies the given function to each group of data.
|
<U> Dataset<U> |
mapGroups(MapGroupsFunction<K,V,U> f,
Encoder<U> encoder)
(Java-specific)
Applies the given function to each group of data.
|
<S,U> Dataset<U> |
mapGroupsWithState(scala.Function3<K,scala.collection.Iterator<V>,GroupState<S>,U> func,
Encoder<S> evidence$5,
Encoder<U> evidence$6)
(Scala-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state.
|
<S,U> Dataset<U> |
mapGroupsWithState(GroupStateTimeout timeoutConf,
scala.Function3<K,scala.collection.Iterator<V>,GroupState<S>,U> func,
Encoder<S> evidence$7,
Encoder<U> evidence$8)
(Scala-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state.
|
<S,U> Dataset<U> |
mapGroupsWithState(MapGroupsWithStateFunction<K,V,S,U> func,
Encoder<S> stateEncoder,
Encoder<U> outputEncoder)
(Java-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state.
|
<S,U> Dataset<U> |
mapGroupsWithState(MapGroupsWithStateFunction<K,V,S,U> func,
Encoder<S> stateEncoder,
Encoder<U> outputEncoder,
GroupStateTimeout timeoutConf)
(Java-specific)
Applies the given function to each group of data, while maintaining a user-defined per-group
state.
|
<W> KeyValueGroupedDataset<K,W> |
mapValues(scala.Function1<V,W> func,
Encoder<W> evidence$2)
Returns a new
KeyValueGroupedDataset where the given function func has been applied
to the data. |
<W> KeyValueGroupedDataset<K,W> |
mapValues(MapFunction<V,W> func,
Encoder<W> encoder)
Returns a new
KeyValueGroupedDataset where the given function func has been applied
to the data. |
org.apache.spark.sql.execution.QueryExecution |
queryExecution() |
Dataset<scala.Tuple2<K,V>> |
reduceGroups(scala.Function2<V,V,V> f)
(Scala-specific)
Reduces the elements of each group of data using the specified binary function.
|
Dataset<scala.Tuple2<K,V>> |
reduceGroups(ReduceFunction<V> f)
(Java-specific)
Reduces the elements of each group of data using the specified binary function.
|
String |
toString() |
public <U1> Dataset<scala.Tuple2<K,U1>> agg(TypedColumn<V,U1> col1)
Dataset
of tuples for each unique key
and the result of computing this aggregation over all elements in the group.
col1
- (undocumented)public <U1,U2> Dataset<scala.Tuple3<K,U1,U2>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)public <U1,U2,U3> Dataset<scala.Tuple4<K,U1,U2,U3>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)public <U1,U2,U3,U4> Dataset<scala.Tuple5<K,U1,U2,U3,U4>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3, TypedColumn<V,U4> col4)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)public <U1,U2,U3,U4,U5> Dataset<scala.Tuple6<K,U1,U2,U3,U4,U5>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3, TypedColumn<V,U4> col4, TypedColumn<V,U5> col5)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)public <U1,U2,U3,U4,U5,U6> Dataset<scala.Tuple7<K,U1,U2,U3,U4,U5,U6>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3, TypedColumn<V,U4> col4, TypedColumn<V,U5> col5, TypedColumn<V,U6> col6)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)col6
- (undocumented)public <U1,U2,U3,U4,U5,U6,U7> Dataset<scala.Tuple8<K,U1,U2,U3,U4,U5,U6,U7>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3, TypedColumn<V,U4> col4, TypedColumn<V,U5> col5, TypedColumn<V,U6> col6, TypedColumn<V,U7> col7)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)col6
- (undocumented)col7
- (undocumented)public <U1,U2,U3,U4,U5,U6,U7,U8> Dataset<scala.Tuple9<K,U1,U2,U3,U4,U5,U6,U7,U8>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3, TypedColumn<V,U4> col4, TypedColumn<V,U5> col5, TypedColumn<V,U6> col6, TypedColumn<V,U7> col7, TypedColumn<V,U8> col8)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)col5
- (undocumented)col6
- (undocumented)col7
- (undocumented)col8
- (undocumented)public <U,R> Dataset<R> cogroup(KeyValueGroupedDataset<K,U> other, scala.Function3<K,scala.collection.Iterator<V>,scala.collection.Iterator<U>,scala.collection.TraversableOnce<R>> f, Encoder<R> evidence$11)
Dataset
this
and other
. The function can return an iterator containing elements of an
arbitrary type which will be returned as a new Dataset
.
other
- (undocumented)f
- (undocumented)evidence$11
- (undocumented)public <U,R> Dataset<R> cogroup(KeyValueGroupedDataset<K,U> other, CoGroupFunction<K,V,U,R> f, Encoder<R> encoder)
Dataset
this
and other
. The function can return an iterator containing elements of an
arbitrary type which will be returned as a new Dataset
.
other
- (undocumented)f
- (undocumented)encoder
- (undocumented)public Dataset<scala.Tuple2<K,Object>> count()
Dataset
that contains a tuple with each key and the number of items present
for that key.
public <U> Dataset<U> flatMapGroups(scala.Function2<K,scala.collection.Iterator<V>,scala.collection.TraversableOnce<U>> f, Encoder<U> evidence$3)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)evidence$3
- (undocumented)public <U> Dataset<U> flatMapGroups(FlatMapGroupsFunction<K,V,U> f, Encoder<U> encoder)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)encoder
- (undocumented)public <S,U> Dataset<U> flatMapGroupsWithState(OutputMode outputMode, GroupStateTimeout timeoutConf, scala.Function3<K,scala.collection.Iterator<V>,GroupState<S>,scala.collection.Iterator<U>> func, Encoder<S> evidence$9, Encoder<U> evidence$10)
GroupState
for more details.
func
- Function to be called on every group.outputMode
- The output mode of the function.timeoutConf
- Timeout configuration for groups that do not receive data for a while.
See Encoder
for more details on what types are encodable to Spark SQL.
evidence$9
- (undocumented)evidence$10
- (undocumented)public <S,U> Dataset<U> flatMapGroupsWithState(FlatMapGroupsWithStateFunction<K,V,S,U> func, OutputMode outputMode, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf)
GroupState
for more details.
func
- Function to be called on every group.outputMode
- The output mode of the function.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.timeoutConf
- Timeout configuration for groups that do not receive data for a while.
See Encoder
for more details on what types are encodable to Spark SQL.
public <L> KeyValueGroupedDataset<L,V> keyAs(Encoder<L> evidence$1)
KeyValueGroupedDataset
where the type of the key has been mapped to the
specified type. The mapping of key columns to the type follows the same rules as as
on
Dataset
.
evidence$1
- (undocumented)public Dataset<K> keys()
Dataset
that contains each unique key. This is equivalent to doing mapping
over the Dataset to extract the keys and then running a distinct operation on those.
public <U> Dataset<U> mapGroups(scala.Function2<K,scala.collection.Iterator<V>,U> f, Encoder<U> evidence$4)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)evidence$4
- (undocumented)public <U> Dataset<U> mapGroups(MapGroupsFunction<K,V,U> f, Encoder<U> encoder)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
org.apache.spark.sql.expressions#Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)encoder
- (undocumented)public <S,U> Dataset<U> mapGroupsWithState(scala.Function3<K,scala.collection.Iterator<V>,GroupState<S>,U> func, Encoder<S> evidence$5, Encoder<U> evidence$6)
GroupState
for more details.
func
- Function to be called on every group.
See Encoder
for more details on what types are encodable to Spark SQL.
evidence$5
- (undocumented)evidence$6
- (undocumented)public <S,U> Dataset<U> mapGroupsWithState(GroupStateTimeout timeoutConf, scala.Function3<K,scala.collection.Iterator<V>,GroupState<S>,U> func, Encoder<S> evidence$7, Encoder<U> evidence$8)
GroupState
for more details.
func
- Function to be called on every group.timeoutConf
- Timeout configuration for groups that do not receive data for a while.
See Encoder
for more details on what types are encodable to Spark SQL.
evidence$7
- (undocumented)evidence$8
- (undocumented)public <S,U> Dataset<U> mapGroupsWithState(MapGroupsWithStateFunction<K,V,S,U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder)
GroupState
for more details.
func
- Function to be called on every group.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.
See Encoder
for more details on what types are encodable to Spark SQL.
public <S,U> Dataset<U> mapGroupsWithState(MapGroupsWithStateFunction<K,V,S,U> func, Encoder<S> stateEncoder, Encoder<U> outputEncoder, GroupStateTimeout timeoutConf)
GroupState
for more details.
func
- Function to be called on every group.stateEncoder
- Encoder for the state type.outputEncoder
- Encoder for the output type.timeoutConf
- Timeout configuration for groups that do not receive data for a while.
See Encoder
for more details on what types are encodable to Spark SQL.
public <W> KeyValueGroupedDataset<K,W> mapValues(scala.Function1<V,W> func, Encoder<W> evidence$2)
KeyValueGroupedDataset
where the given function func
has been applied
to the data. The grouping key is unchanged by this.
// Create values grouped by key from a Dataset[(K, V)]
ds.groupByKey(_._1).mapValues(_._2) // Scala
func
- (undocumented)evidence$2
- (undocumented)public <W> KeyValueGroupedDataset<K,W> mapValues(MapFunction<V,W> func, Encoder<W> encoder)
KeyValueGroupedDataset
where the given function func
has been applied
to the data. The grouping key is unchanged by this.
// Create Integer values grouped by String key from a Dataset<Tuple2<String, Integer>>
Dataset<Tuple2<String, Integer>> ds = ...;
KeyValueGroupedDataset<String, Integer> grouped =
ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, Encoders.INT());
func
- (undocumented)encoder
- (undocumented)public org.apache.spark.sql.execution.QueryExecution queryExecution()
public Dataset<scala.Tuple2<K,V>> reduceGroups(scala.Function2<V,V,V> f)
f
- (undocumented)public Dataset<scala.Tuple2<K,V>> reduceGroups(ReduceFunction<V> f)
f
- (undocumented)public String toString()
toString
in class Object