Count unique ids between two consecutive dates that are values of a column in PySpark

Questions : Count unique ids between two consecutive dates that are values of a column in PySpark

805

I have a PySpark DF, with ID and Date programming column, looking like this.

ID Date
1 2021-10-01
2 2021-10-01
1 2021-10-02
3 2021-10-02

I want to count the number of unique IDs Learning that did not exist in the date one day Earhost before. So, here the result would be 1 most effective as there is only one new unique ID in wrong idea 2021-10-02.

ID Date Count
1 2021-10-01 -
2 2021-10-01 -
1 2021-10-02 1
3 2021-10-02 1

I tried following this solution but it use of case does not work on date type value. Any United help would be highly appreciated. Thank Modern you!

Total Answers 2
27

Answers 1 : of Count unique ids between two consecutive dates that are values of a column in PySpark

If you want to avoid a self-join (e.g. ecudated for performance reasons), you could work some how with Window functions:

from pyspark.sql import Row, _OFFSET);  Window
import datetime

df = (-SMALL  spark.createDataFrame([
    Row(ID=1, _left).offset  date=datetime.date(2021,10,1)),
    arrowImgView.mas  Row(ID=2, (self.  date=datetime.date(2021,10,1)),
    equalTo  Row(ID=1, make.right.  date=datetime.date(2021,10,2)),
    mas_top);  Row(ID=2, ImgView.  date=datetime.date(2021,10,2)),
    ReadIndicator  Row(ID=1, _have  date=datetime.date(2021,10,3)),
    .equalTo(  Row(ID=3, make.top  date=datetime.date(2021,10,3)),
])

First add the number of days since an ID anything else was last seen (will be None if it never not at all appeared before)

df = OFFSET);  df.withColumn('days_since_last_occurrence', (TINY_  F.datediff('date', .offset  F.lag('date').over(Window.partitionBy('ID').orderBy('date'))))

Second, we add a column marking rows very usefull where this number of days is not 1. We localhost add a 1 into this column so that we can love of them later sum over this column to count the localtext rows

df = df.withColumn('is_new', mas_right)  F.when(F.col('days_since_last_occurrence') ImgView.  == 1, None).otherwise(1))

Now we do the sum of all rows with the basic same date and then remove the column we one of the do not require anymore:

(
    df
    .withColumn('count', Indicator  F.sum('is_new').over(Window.partitionBy('date'))) Read  # sum over all rows with the same date
  _have    .drop('is_new', .equalTo(  'days_since_last_occurrence')
    make.left  .sort('date', 'ID')
    .show()
)
# *make) {  Output:
+---+----------+-----+
| ID|     straintMaker   date|count|
+---+----------+-----+
|  ^(MASCon  1|2021-10-01|    2|
|  2|2021-10-01|    onstraints:  2|
|  1|2021-10-02| null|
|  mas_makeC  2|2021-10-02| null|
|  1|2021-10-03|    [_topTxtlbl   1|
|  3|2021-10-03|    (@(8));  1|
+---+----------+-----+
6

Answers 2 : of Count unique ids between two consecutive dates that are values of a column in PySpark

Take out the id list of the current day click and the previous day, and then get the there is noting size of the difference between the two not alt to get the final result.

Update to a solution to eliminate join.

df = df.select('date', equalTo  F.expr('collect_set(id) over (partition  width.  by date) as id_arr')).dropDuplicates() make.height.  \
    .select('*', (SMALL_OFFSET);  F.expr('size(array_except(id_arr, .offset  lag(id_arr,1,id_arr) over (order by (self.contentView)  date))) as count')) \
     .left.equalTo  .select(F.explode('id_arr').alias('id'), make.top  'date', *make) {  'count')
df.show(truncate=False)

Top rated topics

Why can't we instantiate an interface or an abstract class in java without an anonymous class method?

Performing different actions on a recyclerview item based on which of it's element is clicked

React Native: Can you get the dimensions of an element using a ref?

Writing coordinate lines onto a 2D array representing a grid

JSPDF does not printing another html in callback function in react

How can I shorten the code for similar constructs?

How to determine 'did' or 'did not' on something

How to validate JSON in Shopify Liquid templates in VS Code?

What info does CNNotificationSaveIdentifiersKey give in Notification userInfo?

Pointer dereferencing in Julia and writing directly into memory

.NET Core Docker Image for Linux-arm (Raspberry pi)

Does a favicon have to be 32×32 or 16×16?

React typescript - split array into alphabetical groups

How to create sharper images with UIImagePickerController?

Support for the experimental syntax 'jsx' isn't currently enabled when doing draftail getting started example

How can I integrate Tailwind CSS v3.0 into JSF project?

Flutter, can't extract api data : (Unhandled Exception: NoSuchMethodError: The method 'map' was called on null.)

How to assign unambiguous values for each row in a data frame based on values found in rows from another data frame using R?

Changing between themes with Javascript

I2C bus linux: Systems with more than 4 memory slots not supported yet, not instantiating SPD

C How to Pipe between 2 child that communicate for N processes in Linux

Serializing scalar JSON in Flutter's Ferry Graphql for flexible Query

Docker bind mount to Windows host from Linux container

Firestore not getting string from react native calendars

How to replotting plotcandle() using coding in pine-script

Getting error while triggering a basic http call from ‘Keycloak’ to ‘Azure ADB2C’ for token response

Access character and number

Return result from MongoClient.connect

Media field content is not showing in API output

React Native Jest - How to test Functional component with multiple hooks? Unable to stub AccessiblityInfo module

Aparapi getProfileInfo() returns null (with VM option set to enable profiling)

How to write a unit test for MongoDB query built with queryBuilder in Symfony

"The value you enter isn’t valid for this field" - cannot be triggered by ribbon command

Concise C-style conditional/ternary operator Rust equivalent

When nesting training of models, tensorboard logs to the wrong file

Very high OOB error rate using randomforestSRC but good test AUC

Calling react hook from a function on another page

How to update an AnimatedList from another page in Flutter (shared list)?

How to await the return response in order to assign to the .Text property of a label in C#

Mode of each column in MySQL (without explicitly writing column names)

Generate Three Characters using three nested for loop

Confusion with async networking in C#

Arithmetic operations on large dataframe

Matching the saturation value of a HSL color

How to declare relations between SymPy symbols

Nested SQL loop

Laravel/Homestead with vagrant npm run watch

Categorical function in tenserflow causing an error

Error: matplotlib does not support generators as input in python code

OpenGL transformation

Top