Please help view the problem of spark dynamic partition

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Please help view the problem of spark dynamic partition

zhangliyun
Hi all:
  when i use spark dynamic partition feature , i met a problem about hdfs quota.  I found that it is every easy to meet quota problem (exceed the max value of quota of directory)

I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which contains 584M records and will insert it to a  partitioned table "bsl12.email_edge_lyh_partitioned2"
--select count(*) from bsl12.email_edge_lyh_mth1; --584652128
--INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2 PARTITION (link_crtd_date) SELECT * FROM bsl12.email_edge_lyh_mth1;


when i viewed the temporary directory when sql running, i saw  multiple  file with link_crd_date=2018-01-01***, I guess one record one temporary file.  as  there are 584M data in the unpartitioned table,  is there any parameters for us to control the temporary file count  to avoid the quota problem.

```

 

133    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-0112%3A35%3A29

137    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A35%3A47

136    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A38%3A23

132    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A38%3A54

536    hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A40%3A01


```

Best Regards

Kelly Zhang


 

Reply | Threaded
Open this post in threaded view
|

Re: Please help view the problem of spark dynamic partition

Roland Johann
It seems that column `link_crtd_date` is of type `timestamp` and you therefore partition by date including time, which produces a huge amount of directories. I assume your intent is to partition by date (partition_date=yyyy-MM-dd or year=yyyy/month=MM/day=dd) so you need to format/split your timestamp accordingly, for example:

-- partitioned by 'yyyy-MM-dd'
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (partition_date)
SELECT
date_format(link_crtd_date, 'yyyy-MM-dd') as partition_date,
*
FROM bsl12.email_edge_lyh_mth1;

-- partitioned by year/month/day
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (year, month, day)
SELECT
year(link_crtd_date, 'yyyy-MM-dd') as year,
month(link_crtd_date, 'yyyy-MM-dd') as month,
day(link_crtd_date, 'yyyy-MM-dd') as day,
*
FROM bsl12.email_edge_lyh_mth1;
Best Regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann



Am 23.08.2019 um 09:43 schrieb zhangliyun <[hidden email]>:

Hi all:
  when i use spark dynamic partition feature , i met a problem about hdfs quota.  I found that it is every easy to meet quota problem (exceed the max value of quota of directory)

I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which contains 584M records and will insert it to a  partitioned table "bsl12.email_edge_lyh_partitioned2"
--select count(*) from bsl12.email_edge_lyh_mth1; --584652128
--INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2 PARTITION (link_crtd_date) SELECT * FROM bsl12.email_edge_lyh_mth1;


when i viewed the temporary directory when sql running, i saw  multiple  file with link_crd_date=2018-01-01***, I guess one record one temporary file.  as  there are 584M data in the unpartitioned table,  is there any parameters for us to control the temporary file count  to avoid the quota problem.

```

 

133    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-0112%3A35%3A29
137    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A35%3A47
136    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A38%3A23
132    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A38%3A54
536    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A40%3A01

```

Best Regards

Kelly Zhang


 

Reply | Threaded
Open this post in threaded view
|

Re:Re: Please help view the problem of spark dynamic partition

zhangliyun




Hi :
   link_crtd_date  is a string of format "yyyy-MM-dd"  not timestamp.

 select link_crtd_date from bsl12.email_edge_lyh_mth1 limit 10;

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01

2018-01-01




Best Regards
Kelly Zhang



At 2019-08-23 15:57:24, "Roland Johann" <[hidden email]> wrote:
It seems that column `link_crtd_date` is of type `timestamp` and you therefore partition by date including time, which produces a huge amount of directories. I assume your intent is to partition by date (partition_date=yyyy-MM-dd or year=yyyy/month=MM/day=dd) so you need to format/split your timestamp accordingly, for example:

-- partitioned by 'yyyy-MM-dd'
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (partition_date)
SELECT
date_format(link_crtd_date, 'yyyy-MM-dd') as partition_date,
*
FROM bsl12.email_edge_lyh_mth1;

-- partitioned by year/month/day
INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2
PARTITION (year, month, day)
SELECT
year(link_crtd_date, 'yyyy-MM-dd') as year,
month(link_crtd_date, 'yyyy-MM-dd') as month,
day(link_crtd_date, 'yyyy-MM-dd') as day,
*
FROM bsl12.email_edge_lyh_mth1;
Best Regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann



Am 23.08.2019 um 09:43 schrieb zhangliyun <[hidden email]>:

Hi all:
  when i use spark dynamic partition feature , i met a problem about hdfs quota.  I found that it is every easy to meet quota problem (exceed the max value of quota of directory)

I have generated a unpartitioned table 'bsl12.email_edge_lyh_mth1' which contains 584M records and will insert it to a  partitioned table "bsl12.email_edge_lyh_partitioned2"
--select count(*) from bsl12.email_edge_lyh_mth1; --584652128
--INSERT OVERWRITE TABLE bsl12.email_edge_lyh_partitioned2 PARTITION (link_crtd_date) SELECT * FROM bsl12.email_edge_lyh_mth1;


when i viewed the temporary directory when sql running, i saw  multiple  file with link_crd_date=2018-01-01***, I guess one record one temporary file.  as  there are 584M data in the unpartitioned table,  is there any parameters for us to control the temporary file count  to avoid the quota problem.

```

 

133    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-0112%3A35%3A29
137    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A35%3A47
136    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A38%3A23
132    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A38%3A54
536    <a href="hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=" class="">hdfs://horton/apps/risk/ars/datamart/email_edge_lyh_partitioned2/.hive-staging_hive_2019-08-22_19-41-38_747_7237025592628396381-1/-ext-10000/_temporary/0/_temporary/attempt_20190822195048_0000_m_001404_0/link_crtd_date=2018-01-01 12%3A40%3A01

```

Best Regards

Kelly Zhang