Re: Ask for ARM CI for spark

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Takeshi Yamamuro
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi

On Sat, Jul 27, 2019 at 10:35 AM bo zhaobo <[hidden email]> wrote:
Hi all,

Thanks for your concern. Yeah, that's worth to also test in backend database. But need to note here, this issue is hit in Spark SQL, as we only test it with spark itself, not integrate other databases.

Best Regards,

ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/07/27 上午09:30:56

Sean Owen <[hidden email]> 于2019年7月26日周五 下午5:46写道:
Interesting. I don't think log(3) is special, it's just that some
differences in how it's implemented and floating-point values on
aarch64 vs x86, or in the JVM, manifest at some values like this. It's
still a little surprising! BTW Wolfram Alpha suggests that the correct
value is more like ...810969..., right between the two. java.lang.Math
doesn't guarantee strict IEEE floating-point behavior, but
java.lang.StrictMath is supposed to, at the potential cost of speed,
and it gives ...81096, in agreement with aarch64.

@Yuming Wang the results in float8.sql are from PostgreSQL directly?
Interesting if it also returns the same less accurate result, which
might suggest it's more to do with underlying OS math libraries. You
noted that these tests sometimes gave platform-dependent differences
in the last digit, so wondering if the test value directly reflects
PostgreSQL or just what we happen to return now.

One option is to use StrictMath in special cases like computing atanh.
That gives a value that agrees with aarch64.
I also note that 0.5 * (math.log(1 + x) - math.log(1 - x) gives the
more accurate answer too, and makes the result agree with, say,
Wolfram Alpha for atanh(0.5).
(Actually if we do that, better still is 0.5 * (math.log1p(x) -
math.log1p(-x)) for best accuracy near 0)
Commons Math also has implementations of sinh, cosh, atanh that we
could call. It claims it's possibly more accurate and faster. I
haven't tested its result here.

FWIW the "log1p" version appears, from some informal testing, to be
most accurate (in agreement with Wolfram) and using StrictMath doesn't
matter. If we change something, I'd use that version above.
The only issue is if this causes the result to disagree with
PostgreSQL, but then again it's more correct and maybe the DB is
wrong.


The rest may be a test vs PostgreSQL issue; see
https://issues.apache.org/jira/browse/SPARK-28316


On Fri, Jul 26, 2019 at 2:32 AM Tianhua huang <[hidden email]> wrote:
>
> Hi, all
>
>
> Sorry to disturb again, there are several sql tests failed on arm64 instance:
>
> pgSQL/float8.sql *** FAILED ***
> Expected "0.549306144334054[9]", but got "0.549306144334054[8]" Result did not match for query #56
> SELECT atanh(double('0.5')) (SQLQueryTestSuite.scala:362)
> pgSQL/numeric.sql *** FAILED ***
> Expected "2 2247902679199174[72 224790267919917955.1326161858
> 4 7405685069595001 7405685069594999.0773399947
> 5 5068226527.321263 5068226527.3212726541
> 6 281839893606.99365 281839893606.9937234336
> 7 1716699575118595840 1716699575118597095.4233081991
> 8 167361463828.0749 167361463828.0749132007
> 9 107511333880051856] 107511333880052007....", but got "2 2247902679199174[40224790267919917955.1326161858
> 4 7405685069595001 7405685069594999.0773399947
> 5 5068226527.321263 5068226527.3212726541
> 6 281839893606.99365 281839893606.9937234336
> 7 1716699575118595580 1716699575118597095.4233081991
> 8 167361463828.0749 167361463828.0749132007
> 9 107511333880051872] 107511333880052007...." Result did not match for query #496
> SELECT t1.id1, t1.result, t2.expected
> FROM num_result t1, num_exp_power_10_ln t2
> WHERE t1.id1 = t2.id
> AND t1.result != t2.expected (SQLQueryTestSuite.scala:362)
>
> The first test failed, because the value of math.log(3.0) is different on aarch64:
>
> # on x86_64:
>
> scala> val a = 0.5
> a: Double = 0.5
>
> scala> a * math.log((1.0 + a) / (1.0 - a))
> res1: Double = 0.5493061443340549
>
> scala> math.log((1.0 + a) / (1.0 - a))
> res2: Double = 1.0986122886681098
>
> # on aarch64:
>
> scala> val a = 0.5
>
> a: Double = 0.5
>
> scala> a * math.log((1.0 + a) / (1.0 - a))
>
> res20: Double = 0.5493061443340548
>
> scala> math.log((1.0 + a) / (1.0 - a))
>
> res21: Double = 1.0986122886681096
>
> And I tried other several numbers like math.log(4.0) and math.log(5.0) and they are same, I don't know why math.log(3.0) is so special? But the result is different indeed on aarch64. If you are interesting, please try it.
>
> The second test failed, because some values of pow(10, x) is different on aarch64, according to sql tests of spark, I took similar tests on aarch64 and x86_64, take '-83028485' as example:
>
> # on x86_64:
> scala> import java.lang.Math._
> import java.lang.Math._
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res4: Int = 83028485
> scala> math.log(abs(a))
> res5: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res6: Double = 1.71669957511859584E18
>
> # on aarch64:
>
> scala> var a = -83028485
> a: Int = -83028485
> scala> abs(a)
> res38: Int = 83028485
>
> scala> math.log(abs(a))
>
> res39: Double = 18.234694299654787
> scala> pow(10, math.log(abs(a)))
> res40: Double = 1.71669957511859558E18
>
> I send an email to jdk-dev, hope someone can help, and also I proposed this to JIRA  https://issues.apache.org/jira/browse/SPARK-28519, , if you are interesting, welcome to join and discuss, thank you very much.
>
>
> On Thu, Jul 18, 2019 at 11:12 AM Tianhua huang <[hidden email]> wrote:
>>
>> Thanks for your reply.
>>
>> About the first problem we didn't find any other reason in log, just found timeout to wait the executor up, and after increase the timeout from 10000 ms to 30000(even 20000)ms, https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L764  https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/SparkContextSuite.scala#L792  the test passed, and there are more than one executor up, not sure whether it's related with the flavor of our aarch64 instance? Now the flavor of the instance is 8C8G. Maybe we will try the bigger flavor later. Or any one has other suggestion, please contact me, thank you.
>>
>> About the second problem, I proposed a pull request to apache/spark, https://github.com/apache/spark/pull/25186  if you have time, would you please to help to review it, thank you very much.
>>
>> On Wed, Jul 17, 2019 at 8:37 PM Sean Owen <[hidden email]> wrote:
>>>
>>> On Wed, Jul 17, 2019 at 6:28 AM Tianhua huang <[hidden email]> wrote:
>>> > Two failed and the reason is 'Can't find 1 executors before 10000 milliseconds elapsed', see below, then we try increase timeout the tests passed, so wonder if we can increase the timeout? and here I have another question about https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TestUtils.scala#L285, why is not >=? see the comment of the function, it should be >=?
>>> >
>>>
>>> I think it's ">" because the driver is also an executor, but not 100%
>>> sure. In any event it passes in general.
>>> These errors typically mean "I didn't start successfully" for some
>>> other reason that may be in the logs.
>>>
>>> > The other two failed and the reason is '2143289344 equaled 2143289344', this because the value of floatToRawIntBits(0.0f/0.0f) on aarch64 platform is 2143289344 and equals to floatToRawIntBits(Float.NaN). About this I send email to jdk-dev and proposed a topic on scala community https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845 and https://github.com/scala/bug/issues/11632, I thought it's something about jdk or scala, but after discuss, it should related with platform, so seems the following asserts is not appropriate? https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala#L704-L705 and https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala#L732-L733
>>>
>>> These tests could special-case execution on ARM, like you'll see some
>>> tests handle big-endian architectures.

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
---
Takeshi Yamamuro
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Sean Owen-2
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi

Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Tianhua huang
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi

Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

bo zhaobo
Hi, team. 
I want to make the same test on ARM like existing CI does(x86). As building and testing the whole spark projects will cost too long time, so I plan to split them to multiple jobs to run for lower time cost. But I cannot see what the existing CI[1] have done(so many private scripts called), so could any CI maintainers help/tell us for how to split them and the details about different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run the different CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu', are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo


Mailtrack Sender notified by
Mailtrack 19/07/31 上午11:53:36

Tianhua huang <[hidden email]> 于2019年7月29日周一 上午9:38写道:
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi

Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

bo zhaobo

Hi Team,

Any updates about the CI details? ;-)

Also, I will also need your kind help about Spark QA test, could any one can tell us how to trigger that tests? When? How?  So far, I haven't notices how it works.

Thanks 

Best Regards,

ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/02 下午05:37:30

bo zhaobo <[hidden email]> 于2019年7月31日周三 上午11:56写道:
Hi, team. 
I want to make the same test on ARM like existing CI does(x86). As building and testing the whole spark projects will cost too long time, so I plan to split them to multiple jobs to run for lower time cost. But I cannot see what the existing CI[1] have done(so many private scripts called), so could any CI maintainers help/tell us for how to split them and the details about different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run the different CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu', are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo


Mailtrack Sender notified by
Mailtrack 19/07/31 上午11:53:36

Tianhua huang <[hidden email]> 于2019年7月29日周一 上午9:38写道:
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi

Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

shane knapp
i'm out of town, but will answer some of your questions next week.

On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo <[hidden email]> wrote:

Hi Team,

Any updates about the CI details? ;-)

Also, I will also need your kind help about Spark QA test, could any one can tell us how to trigger that tests? When? How?  So far, I haven't notices how it works.

Thanks 

Best Regards,

ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/02 下午05:37:30

bo zhaobo <[hidden email]> 于2019年7月31日周三 上午11:56写道:
Hi, team. 
I want to make the same test on ARM like existing CI does(x86). As building and testing the whole spark projects will cost too long time, so I plan to split them to multiple jobs to run for lower time cost. But I cannot see what the existing CI[1] have done(so many private scripts called), so could any CI maintainers help/tell us for how to split them and the details about different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run the different CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu', are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo


Mailtrack Sender notified by
Mailtrack 19/07/31 上午11:53:36

Tianhua huang <[hidden email]> 于2019年7月29日周一 上午9:38写道:
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

bo zhaobo
Hi shane,
Thanks for your reply. I will wait for you back. ;-)

Thanks,
Best regards
ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/06 上午09:06:23

shane knapp <[hidden email]> 于2019年8月2日周五 下午10:41写道:
i'm out of town, but will answer some of your questions next week.

On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo <[hidden email]> wrote:

Hi Team,

Any updates about the CI details? ;-)

Also, I will also need your kind help about Spark QA test, could any one can tell us how to trigger that tests? When? How?  So far, I haven't notices how it works.

Thanks 

Best Regards,

ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/02 下午05:37:30

bo zhaobo <[hidden email]> 于2019年7月31日周三 上午11:56写道:
Hi, team. 
I want to make the same test on ARM like existing CI does(x86). As building and testing the whole spark projects will cost too long time, so I plan to split them to multiple jobs to run for lower time cost. But I cannot see what the existing CI[1] have done(so many private scripts called), so could any CI maintainers help/tell us for how to split them and the details about different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run the different CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu', are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo


Mailtrack Sender notified by
Mailtrack 19/07/31 上午11:53:36

Tianhua huang <[hidden email]> 于2019年7月29日周一 上午9:38写道:
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Tianhua huang
Hi all,

About the arm test of spark, recently we found two tests failed after the commit https://github.com/apache/spark/pull/23767:
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
        
We tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

On Tue, Aug 6, 2019 at 9:08 AM bo zhaobo <[hidden email]> wrote:
Hi shane,
Thanks for your reply. I will wait for you back. ;-)

Thanks,
Best regards
ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/06 上午09:06:23

shane knapp <[hidden email]> 于2019年8月2日周五 下午10:41写道:
i'm out of town, but will answer some of your questions next week.

On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo <[hidden email]> wrote:

Hi Team,

Any updates about the CI details? ;-)

Also, I will also need your kind help about Spark QA test, could any one can tell us how to trigger that tests? When? How?  So far, I haven't notices how it works.

Thanks 

Best Regards,

ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/02 下午05:37:30

bo zhaobo <[hidden email]> 于2019年7月31日周三 上午11:56写道:
Hi, team. 
I want to make the same test on ARM like existing CI does(x86). As building and testing the whole spark projects will cost too long time, so I plan to split them to multiple jobs to run for lower time cost. But I cannot see what the existing CI[1] have done(so many private scripts called), so could any CI maintainers help/tell us for how to split them and the details about different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run the different CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu', are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo


Mailtrack Sender notified by
Mailtrack 19/07/31 上午11:53:36

Tianhua huang <[hidden email]> 于2019年7月29日周一 上午9:38写道:
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Tianhua huang
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance based on master and the job includes  https://github.com/theopenlab/spark/pull/13  and k8s integration https://github.com/theopenlab/spark/pull/17/ , there are several things I want to talk about:

First, about the failed tests:
    1.we have fixed some problems like https://github.com/apache/spark/pull/25186 and https://github.com/apache/spark/pull/25279, thanks sean owen and others to help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch hangs,  the tests passed  after adding '--network host' option for command `docker build`, see:
        https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176  , the solution refers to https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know whether it happened once in community CI, or maybe we should submit a pr to pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit  https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       
        we tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to 16U16G, but seems there was no significant improvement, the k8s integration test took about one and a half hours, and the QA test(like spark-master-test-maven-hadoop-2.7 community jenkins job) took about seventeen hours(it is too long :(), we suspect that the reason is the performance and network,
we split the jobs based on projects such as sql, core and so on, the time can be decrease to about seven hours, see https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it looks all tests seem never download the jar packages from maven centry repo(such as https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). So we want to know how the jenkins jobs can do that, is there a internal maven repo launched? maybe we can do the same thing to avoid the network connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe that it is necessary, right? And you can see we really made a lot of efforts, now the basic arm build/test jobs is ok, so we suggest to add arm jobs to community, we can set them to novoting firstly, and improve/rich the jobs step by step. Generally, there are two ways in our mind to integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We provide human resources and test ARM VMs, also we will focus on the ARM related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still provide human resources, focus on the ARM related issues about Spark and push the PR into community.
Both options, we will provide human resources to maintain, of course it will be great if we can work together. So please tell us which option you would like? And let's move forward. Waiting for your reply, thank you very much.

On Wed, Aug 14, 2019 at 10:30 AM Tianhua huang <[hidden email]> wrote:
OK, thanks. 

On Tue, Aug 13, 2019 at 8:37 PM Sean Owen <[hidden email]> wrote:
-dev@ -- it's better not to send to the whole list to discuss specific changes or issues from here. You can reply on the pull request.
I don't know what the issue is either at a glance.

On Tue, Aug 13, 2019 at 2:54 AM Tianhua huang <[hidden email]> wrote:
Hi all,

About the arm test of spark, recently we found two tests failed after the commit https://github.com/apache/spark/pull/23767:
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
        
We tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

On Tue, Aug 6, 2019 at 9:08 AM bo zhaobo <[hidden email]> wrote:
Hi shane,
Thanks for your reply. I will wait for you back. ;-)

Thanks,
Best regards
ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/06 上午09:06:23

shane knapp <[hidden email]> 于2019年8月2日周五 下午10:41写道:
i'm out of town, but will answer some of your questions next week.

On Fri, Aug 2, 2019 at 2:39 AM bo zhaobo <[hidden email]> wrote:

Hi Team,

Any updates about the CI details? ;-)

Also, I will also need your kind help about Spark QA test, could any one can tell us how to trigger that tests? When? How?  So far, I haven't notices how it works.

Thanks 

Best Regards,

ZhaoBo



Mailtrack Sender notified by
Mailtrack 19/08/02 下午05:37:30

bo zhaobo <[hidden email]> 于2019年7月31日周三 上午11:56写道:
Hi, team. 
I want to make the same test on ARM like existing CI does(x86). As building and testing the whole spark projects will cost too long time, so I plan to split them to multiple jobs to run for lower time cost. But I cannot see what the existing CI[1] have done(so many private scripts called), so could any CI maintainers help/tell us for how to split them and the details about different CI jobs does? Such as PR title contains [SQL], [INFRA], [ML], [DOC], [CORE], [PYTHON], [k8s], [DSTREAMS], [MLlib], [SCHEDULER], [SS],[YARN], [BUIILD] and etc..I found each of them seems run the different CI job.

@shane knapp,
Oh, sorry for disturb. I found your email looks like from 'berkeley.edu', are you the good guy who we are looking for help about this? ;-)
If so, could you give some helps or advices? Thank you.

Thank you very much,

Best Regards,

ZhaoBo


Mailtrack Sender notified by
Mailtrack 19/07/31 上午11:53:36

Tianhua huang <[hidden email]> 于2019年7月29日周一 上午9:38写道:
[hidden email]  Thank you very much. And I saw your reply comment in https://issues.apache.org/jira/browse/SPARK-28519, I will test with modification and to see whether there are other similar tests fail, and will address them together in one pull request.

On Sat, Jul 27, 2019 at 9:04 PM Sean Owen <[hidden email]> wrote:
Great thanks - we can take this to JIRAs now.
I think it's worth changing the implementation of atanh if the test value just reflects what Spark does, and there's evidence is a little bit inaccurate.
There's an equivalent formula which seems to have better accuracy.

On Fri, Jul 26, 2019 at 10:02 PM Takeshi Yamamuro <[hidden email]> wrote:
Hi, all,

FYI:
>> @Yuming Wang the results in float8.sql are from PostgreSQL directly?
>> Interesting if it also returns the same less accurate result, which
>> might suggest it's more to do with underlying OS math libraries. You
>> noted that these tests sometimes gave platform-dependent differences
>> in the last digit, so wondering if the test value directly reflects
>> PostgreSQL or just what we happen to return now.

The results in float8.sql.out were recomputed in Spark/JVM.

As you can see in the file (float8.out), the results other than atanh also are different between Spark/JVM and PostgreSQL.
For example, the answers of acosh are:
-- PostgreSQL
1.31695789692482

-- Spark/JVM
1.3169578969248166

btw, the PostgreSQL implementation for atanh just calls atanh in math.h:

Bests,
Takeshi



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Sean Owen-2
I think the right goal is to fix the remaining issues first. If we set up CI/CD it will only tell us there are still some test failures. If it's stable, and not hard to add to the existing CI/CD, yes it could be done automatically later. You can continue to test on ARM independently for now.

It sounds indeed like there are some networking problems in the test system if you're not able to download from Maven Central. That rarely takes significant time, and there aren't project-specific mirrors here. You might be able to point at a closer public mirror, depending on where you are.

On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <[hidden email]> wrote:
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance based on master and the job includes  https://github.com/theopenlab/spark/pull/13  and k8s integration https://github.com/theopenlab/spark/pull/17/ , there are several things I want to talk about:

First, about the failed tests:
    1.we have fixed some problems like https://github.com/apache/spark/pull/25186 and https://github.com/apache/spark/pull/25279, thanks sean owen and others to help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch hangs,  the tests passed  after adding '--network host' option for command `docker build`, see:
        https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176  , the solution refers to https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know whether it happened once in community CI, or maybe we should submit a pr to pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit  https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       
        we tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to 16U16G, but seems there was no significant improvement, the k8s integration test took about one and a half hours, and the QA test(like spark-master-test-maven-hadoop-2.7 community jenkins job) took about seventeen hours(it is too long :(), we suspect that the reason is the performance and network,
we split the jobs based on projects such as sql, core and so on, the time can be decrease to about seven hours, see https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it looks all tests seem never download the jar packages from maven centry repo(such as https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). So we want to know how the jenkins jobs can do that, is there a internal maven repo launched? maybe we can do the same thing to avoid the network connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe that it is necessary, right? And you can see we really made a lot of efforts, now the basic arm build/test jobs is ok, so we suggest to add arm jobs to community, we can set them to novoting firstly, and improve/rich the jobs step by step. Generally, there are two ways in our mind to integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We provide human resources and test ARM VMs, also we will focus on the ARM related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still provide human resources, focus on the ARM related issues about Spark and push the PR into community.
Both options, we will provide human resources to maintain, of course it will be great if we can work together. So please tell us which option you would like? And let's move forward. Waiting for your reply, thank you very much.
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Tianhua huang
[hidden email] , thanks for your reply. 
I agree with you basically, two points I have to say :)
First, maybe I didn't express clear enough, now we download from Maven Central in our test system, seems the community jenkins ci tests never download the jar packages from maven centry repo, our question is if there is an internal maven repo in community jenkins?
Second, about the failed tests, of course we will continue to figure them out, and hope if someone can help/join us:) but I am afraid if we have to wait it to be "stable"(maybe you mean no failed tests?) And the failed tests of ReplayListenerSuite mentioned last mail are passed before, we suspect it introduced by https://github.com/apache/spark/pull/23767, we revert the code and the tests passed, so hope someone can help us to look deep into it. Now the tests we took based on master, if some modification introduce errors, the test will fail, I think this is one reason we need arm ci. 

Thank you all :)

On Thu, Aug 15, 2019 at 9:58 PM Sean Owen <[hidden email]> wrote:
I think the right goal is to fix the remaining issues first. If we set up CI/CD it will only tell us there are still some test failures. If it's stable, and not hard to add to the existing CI/CD, yes it could be done automatically later. You can continue to test on ARM independently for now.

It sounds indeed like there are some networking problems in the test system if you're not able to download from Maven Central. That rarely takes significant time, and there aren't project-specific mirrors here. You might be able to point at a closer public mirror, depending on where you are.

On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <[hidden email]> wrote:
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance based on master and the job includes  https://github.com/theopenlab/spark/pull/13  and k8s integration https://github.com/theopenlab/spark/pull/17/ , there are several things I want to talk about:

First, about the failed tests:
    1.we have fixed some problems like https://github.com/apache/spark/pull/25186 and https://github.com/apache/spark/pull/25279, thanks sean owen and others to help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch hangs,  the tests passed  after adding '--network host' option for command `docker build`, see:
        https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176  , the solution refers to https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know whether it happened once in community CI, or maybe we should submit a pr to pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit  https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       
        we tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to 16U16G, but seems there was no significant improvement, the k8s integration test took about one and a half hours, and the QA test(like spark-master-test-maven-hadoop-2.7 community jenkins job) took about seventeen hours(it is too long :(), we suspect that the reason is the performance and network,
we split the jobs based on projects such as sql, core and so on, the time can be decrease to about seven hours, see https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it looks all tests seem never download the jar packages from maven centry repo(such as https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). So we want to know how the jenkins jobs can do that, is there a internal maven repo launched? maybe we can do the same thing to avoid the network connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe that it is necessary, right? And you can see we really made a lot of efforts, now the basic arm build/test jobs is ok, so we suggest to add arm jobs to community, we can set them to novoting firstly, and improve/rich the jobs step by step. Generally, there are two ways in our mind to integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We provide human resources and test ARM VMs, also we will focus on the ARM related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still provide human resources, focus on the ARM related issues about Spark and push the PR into community.
Both options, we will provide human resources to maintain, of course it will be great if we can work together. So please tell us which option you would like? And let's move forward. Waiting for your reply, thank you very much.
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

bo zhaobo
In reply to this post by Sean Owen-2
Hi Sean,

Thanks very much for pointing out the roadmap. ;-). Then I think we will continue to focus on our test environment.

For the networking problems, I mean that we can access Maven Central, and jobs cloud download the required jar package with a high network speed. What we want to know is that, why the Spark QA test jobs[1] log shows the job script/maven build seem don't download the jar packages? Could you tell us the reason about that? Thank you.  The reason we raise the "networking problems" is that we found a phenomenon during we test, if we execute "mvn clean package" in a new test environment(As in our test environment, we will destory the test VMs after the job is finish), maven will download the dependency jar packages from Maven Central, but in this job "spark-master-test-maven-hadoop" [2], from the log, we didn't found it download any jar packages, what the reason about that?  
Also we build the Spark jar with downloading dependencies from Maven Central, it will cost mostly 1 hour. And we found [2] just cost 10min. But if we run "mvn package" in a VM which already exec "mvn package" before, it just cost 14min, looks very closer with [2]. So we suspect that downloading the Jar packages cost so much time. For the goad of ARM CI, we expect the performance of NEW ARM CI could be closer with existing X86 CI, then users could accept it eaiser. 


Best regards

ZhaoBo




Mailtrack Sender notified by
Mailtrack 19/08/16 上午09:48:43

Sean Owen <[hidden email]> 于2019年8月15日周四 下午9:58写道:
I think the right goal is to fix the remaining issues first. If we set up CI/CD it will only tell us there are still some test failures. If it's stable, and not hard to add to the existing CI/CD, yes it could be done automatically later. You can continue to test on ARM independently for now.

It sounds indeed like there are some networking problems in the test system if you're not able to download from Maven Central. That rarely takes significant time, and there aren't project-specific mirrors here. You might be able to point at a closer public mirror, depending on where you are.

On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <[hidden email]> wrote:
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance based on master and the job includes  https://github.com/theopenlab/spark/pull/13  and k8s integration https://github.com/theopenlab/spark/pull/17/ , there are several things I want to talk about:

First, about the failed tests:
    1.we have fixed some problems like https://github.com/apache/spark/pull/25186 and https://github.com/apache/spark/pull/25279, thanks sean owen and others to help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch hangs,  the tests passed  after adding '--network host' option for command `docker build`, see:
        https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176  , the solution refers to https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know whether it happened once in community CI, or maybe we should submit a pr to pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit  https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       
        we tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to 16U16G, but seems there was no significant improvement, the k8s integration test took about one and a half hours, and the QA test(like spark-master-test-maven-hadoop-2.7 community jenkins job) took about seventeen hours(it is too long :(), we suspect that the reason is the performance and network,
we split the jobs based on projects such as sql, core and so on, the time can be decrease to about seven hours, see https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it looks all tests seem never download the jar packages from maven centry repo(such as https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). So we want to know how the jenkins jobs can do that, is there a internal maven repo launched? maybe we can do the same thing to avoid the network connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe that it is necessary, right? And you can see we really made a lot of efforts, now the basic arm build/test jobs is ok, so we suggest to add arm jobs to community, we can set them to novoting firstly, and improve/rich the jobs step by step. Generally, there are two ways in our mind to integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We provide human resources and test ARM VMs, also we will focus on the ARM related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still provide human resources, focus on the ARM related issues about Spark and push the PR into community.
Both options, we will provide human resources to maintain, of course it will be great if we can work together. So please tell us which option you would like? And let's move forward. Waiting for your reply, thank you very much.
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

Sean Owen-2
I'm not sure what you mean. The dependencies are downloaded by SBT and Maven like in any other project, and nothing about it is specific to Spark. 
The worker machines cache artifacts that are downloaded from these, but this is a function of Maven and SBT, not Spark. You may find that the initial download takes a long time.

On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <[hidden email]> wrote:
Hi Sean,

Thanks very much for pointing out the roadmap. ;-). Then I think we will continue to focus on our test environment.

For the networking problems, I mean that we can access Maven Central, and jobs cloud download the required jar package with a high network speed. What we want to know is that, why the Spark QA test jobs[1] log shows the job script/maven build seem don't download the jar packages? Could you tell us the reason about that? Thank you.  The reason we raise the "networking problems" is that we found a phenomenon during we test, if we execute "mvn clean package" in a new test environment(As in our test environment, we will destory the test VMs after the job is finish), maven will download the dependency jar packages from Maven Central, but in this job "spark-master-test-maven-hadoop" [2], from the log, we didn't found it download any jar packages, what the reason about that?  
Also we build the Spark jar with downloading dependencies from Maven Central, it will cost mostly 1 hour. And we found [2] just cost 10min. But if we run "mvn package" in a VM which already exec "mvn package" before, it just cost 14min, looks very closer with [2]. So we suspect that downloading the Jar packages cost so much time. For the goad of ARM CI, we expect the performance of NEW ARM CI could be closer with existing X86 CI, then users could accept it eaiser. 


Best regards

ZhaoBo




Mailtrack Sender notified by
Mailtrack 19/08/16 上午09:48:43

Sean Owen <[hidden email]> 于2019年8月15日周四 下午9:58写道:
I think the right goal is to fix the remaining issues first. If we set up CI/CD it will only tell us there are still some test failures. If it's stable, and not hard to add to the existing CI/CD, yes it could be done automatically later. You can continue to test on ARM independently for now.

It sounds indeed like there are some networking problems in the test system if you're not able to download from Maven Central. That rarely takes significant time, and there aren't project-specific mirrors here. You might be able to point at a closer public mirror, depending on where you are.

On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <[hidden email]> wrote:
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance based on master and the job includes  https://github.com/theopenlab/spark/pull/13  and k8s integration https://github.com/theopenlab/spark/pull/17/ , there are several things I want to talk about:

First, about the failed tests:
    1.we have fixed some problems like https://github.com/apache/spark/pull/25186 and https://github.com/apache/spark/pull/25279, thanks sean owen and others to help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch hangs,  the tests passed  after adding '--network host' option for command `docker build`, see:
        https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176  , the solution refers to https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know whether it happened once in community CI, or maybe we should submit a pr to pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit  https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       
        we tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to 16U16G, but seems there was no significant improvement, the k8s integration test took about one and a half hours, and the QA test(like spark-master-test-maven-hadoop-2.7 community jenkins job) took about seventeen hours(it is too long :(), we suspect that the reason is the performance and network,
we split the jobs based on projects such as sql, core and so on, the time can be decrease to about seven hours, see https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it looks all tests seem never download the jar packages from maven centry repo(such as https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). So we want to know how the jenkins jobs can do that, is there a internal maven repo launched? maybe we can do the same thing to avoid the network connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe that it is necessary, right? And you can see we really made a lot of efforts, now the basic arm build/test jobs is ok, so we suggest to add arm jobs to community, we can set them to novoting firstly, and improve/rich the jobs step by step. Generally, there are two ways in our mind to integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We provide human resources and test ARM VMs, also we will focus on the ARM related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still provide human resources, focus on the ARM related issues about Spark and push the PR into community.
Both options, we will provide human resources to maintain, of course it will be great if we can work together. So please tell us which option you would like? And let's move forward. Waiting for your reply, thank you very much.
Reply | Threaded
Open this post in threaded view
|

Re: Ask for ARM CI for spark

bo zhaobo
Hi Sean,
Thanks for reply. And very apologize for making you confused.
I know the dependencies will be downloaded from SBT or Maven. But the Spark QA job also exec "mvn clean package", why the log didn't print "downloading some jar from Maven central [1] and build very fast. Is the reason that Spark Jenkins build the Spark jars in the physical machiines and won't destrory the test env after job is finished? Then the other job build Spark will get the dependencies jar from the local cached, as the previous jobs exec "mvn package", those dependencies had been downloaded already on local worker machine. Am I right? Is that the reason the job log[1] didn't print any downloading information from Maven Central? 

Thank you very much.



Best regards

ZhaoBo

Mailtrack Sender notified by
Mailtrack 19/08/16 下午03:58:53

Sean Owen <[hidden email]> 于2019年8月16日周五 上午10:38写道:
I'm not sure what you mean. The dependencies are downloaded by SBT and Maven like in any other project, and nothing about it is specific to Spark. 
The worker machines cache artifacts that are downloaded from these, but this is a function of Maven and SBT, not Spark. You may find that the initial download takes a long time.

On Thu, Aug 15, 2019 at 9:02 PM bo zhaobo <[hidden email]> wrote:
Hi Sean,

Thanks very much for pointing out the roadmap. ;-). Then I think we will continue to focus on our test environment.

For the networking problems, I mean that we can access Maven Central, and jobs cloud download the required jar package with a high network speed. What we want to know is that, why the Spark QA test jobs[1] log shows the job script/maven build seem don't download the jar packages? Could you tell us the reason about that? Thank you.  The reason we raise the "networking problems" is that we found a phenomenon during we test, if we execute "mvn clean package" in a new test environment(As in our test environment, we will destory the test VMs after the job is finish), maven will download the dependency jar packages from Maven Central, but in this job "spark-master-test-maven-hadoop" [2], from the log, we didn't found it download any jar packages, what the reason about that?  
Also we build the Spark jar with downloading dependencies from Maven Central, it will cost mostly 1 hour. And we found [2] just cost 10min. But if we run "mvn package" in a VM which already exec "mvn package" before, it just cost 14min, looks very closer with [2]. So we suspect that downloading the Jar packages cost so much time. For the goad of ARM CI, we expect the performance of NEW ARM CI could be closer with existing X86 CI, then users could accept it eaiser. 


Best regards

ZhaoBo




Mailtrack Sender notified by
Mailtrack 19/08/16 上午09:48:43

Sean Owen <[hidden email]> 于2019年8月15日周四 下午9:58写道:
I think the right goal is to fix the remaining issues first. If we set up CI/CD it will only tell us there are still some test failures. If it's stable, and not hard to add to the existing CI/CD, yes it could be done automatically later. You can continue to test on ARM independently for now.

It sounds indeed like there are some networking problems in the test system if you're not able to download from Maven Central. That rarely takes significant time, and there aren't project-specific mirrors here. You might be able to point at a closer public mirror, depending on where you are.

On Thu, Aug 15, 2019 at 5:43 AM Tianhua huang <[hidden email]> wrote:
Hi all,

I want to discuss spark ARM CI again, we took some tests on arm instance based on master and the job includes  https://github.com/theopenlab/spark/pull/13  and k8s integration https://github.com/theopenlab/spark/pull/17/ , there are several things I want to talk about:

First, about the failed tests:
    1.we have fixed some problems like https://github.com/apache/spark/pull/25186 and https://github.com/apache/spark/pull/25279, thanks sean owen and others to help us.
    2.we tried k8s integration test on arm, and met an error: apk fetch hangs,  the tests passed  after adding '--network host' option for command `docker build`, see:
        https://github.com/theopenlab/spark/pull/17/files#diff-5b731b14068240d63a93c393f6f9b1e8R176  , the solution refers to https://github.com/gliderlabs/docker-alpine/issues/307  and I don't know whether it happened once in community CI, or maybe we should submit a pr to pass  '--network host' when `docker build`?
    3.we found there are two tests failed after the commit  https://github.com/apache/spark/pull/23767  :
       ReplayListenerSuite:
       - ...
       - End-to-end replay *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       - End-to-end replay with compression *** FAILED ***
         "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
       
        we tried to revert the commit and then the tests passed, the patch is too big and so sorry we can't find the reason till now, if you are interesting please try it, and it will be very appreciate          if someone can help us to figure it out.

Second, about the test time, we increased the flavor of arm instance to 16U16G, but seems there was no significant improvement, the k8s integration test took about one and a half hours, and the QA test(like spark-master-test-maven-hadoop-2.7 community jenkins job) took about seventeen hours(it is too long :(), we suspect that the reason is the performance and network,
we split the jobs based on projects such as sql, core and so on, the time can be decrease to about seven hours, see https://github.com/theopenlab/spark/pull/19 We found the Spark QA tests like  https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/   , it looks all tests seem never download the jar packages from maven centry repo(such as https://repo.maven.apache.org/maven2/org/opencypher/okapi-api/0.4.2/okapi-api-0.4.2.jar). So we want to know how the jenkins jobs can do that, is there a internal maven repo launched? maybe we can do the same thing to avoid the network connection cost during downloading the dependent jar packages.

Third, the most important thing, it's about ARM CI of spark, we believe that it is necessary, right? And you can see we really made a lot of efforts, now the basic arm build/test jobs is ok, so we suggest to add arm jobs to community, we can set them to novoting firstly, and improve/rich the jobs step by step. Generally, there are two ways in our mind to integrate the ARM CI for spark:
     1) We introduce openlab ARM CI into spark as a custom CI system. We provide human resources and test ARM VMs, also we will focus on the ARM related issues about Spark. We will push the PR into community.
     2) We donate ARM VM resources into existing amplab Jenkins. We still provide human resources, focus on the ARM related issues about Spark and push the PR into community.
Both options, we will provide human resources to maintain, of course it will be great if we can work together. So please tell us which option you would like? And let's move forward. Waiting for your reply, thank you very much.
12