#wallaroo on 2018-02-20 — irc logs at freenode.irclog.whitequark.org

2017-09-30 09:40 SeanTAllen changed the topic of #wallaroo to: Welcome! Please check out our Code of Conduct -> https://github.com/WallarooLabs/wallaroo/blob/master/CODE_OF_CONDUCT.md | Public IRC Logs are available at -> https://irclog.whitequark.org/wallaroo

14:03 aturley has joined #wallaroo

14:03 aturley has quit [Client Quit]

14:20 aturley has joined #wallaroo

16:00 mynameisvinn has joined #wallaroo

16:01 mynameisvinn has quit [Client Quit]

18:15 nemosupremo has joined #wallaroo

18:18 <nemosupremo> I'm exploring Wallaroo, and I'm wondering if there are any example of working with Wallaroo with an external database for state. I currently have an analytics pipeline in Spark+Cassnadra and I'm thinking of exploring Wallaroo for performance reasons. I'd like to persist data to Cassandra (for example, minutely metrics that shouldn't persist in Wallaroo forever)

18:19 <nemosupremo> I also need to load some external state into Wallaroo for processing as well (ex. customer configuration data)

18:19 <nemosupremo> I've only been researching for a weekend, so I may have missed some basic concepts

18:22 <SeanTAllen> nemosupremo: we currently do not have cassandra sources or sinks. although both would be interesting. right now, you'd need to write sources/sinks in Pony. We are about to start work on an API to allow sources and sinks to be written in Python and Go.

18:23 <SeanTAllen> at the moment, you could load data into kafka and use the kafka source to read it in then output again to kafka using the kafka sink. or you could do the same with the TCPSource and TCPSink.

18:24 <SeanTAllen> what sort of external state are you looking to load in? how is it used?

18:24 <jonbrwn> I wrote a post about the high-level design decisions behind the metrics capturing system used by Wallaroo to maintain a low-overhead and how some of those decisions can apply to other high-performance systems: https://blog.wallaroolabs.com/2018/02/building-low-overhead-metrics-collection-for-high-performance-systems/

18:24 <jonbrwn> If you care to get more of a lower-level breakdown of the design choices, Nisan wrote a great post on that: https://blog.wallaroolabs.com/2018/02/latency-histograms-and-percentile-distributions-in-wallaroo-performance-metrics/

18:24 <nemosupremo> yeah thats the conclusion I came too, but it seemed too hard to be true - are there no database sinks currently? Is persisting data to a database not a current goal?

18:25 <nemosupremo> @SeanTAllen: concerning external state, its mostly client info. For example, lets say we get information about Device A. Somehow the pipeline has to know that Device A should be rolled into Customer B's dataset

18:28 <SeanTAllen> nemosupremo: we plan on adding database sinks after we do the API for Python/Go sources/sinks. TCP and Kafka have allowed folks we've worked with so far to accomplish their goals. You could do cassandra write out from a step, its not perfect but its a workaround for now. However, that probably isn't going to address your underlying need.

18:29 <nemosupremo> If I can output to Kafka, thats not too bad - I've once had to do something similar for Elasticsearch to make sure our ES cluster wasn't overloaded

18:29 <SeanTAllen> A wider variety of integrations is definitely something we plan on doing. We are a small team and there's lots to do. Allowing folks to write those integrations themselves in the language of their choice is our first step towards opening that up. In general, right now, we are prioritizing the needs of clients we are working with.

18:29 <SeanTAllen> nemosupremo: i have a lot of experience from my last job with overloading ES clusters.

18:30 <SeanTAllen> hmmm... regarding external state. I'd need to understand your design a little more. Right now, the idea with Wallaroo is that you would stream your external state in via another pipeline that can interact with any state partition whose computations would need access to the state.

18:31 <nemosupremo> For me, its more about getting the semantics / correctness right, which is what I'm researching. For example, if I'm computing minutely metrics I need to somehow (AFAIK) expire data from Wallaroo (so I don't hold the whole history in RAM), and then make sure I'm flushing the state correctly.

18:31 <SeanTAllen> for example, we built a POC with a large bank that was doing risk analysis on client accounts, sometimes the parameters for given clients would change, those were streamed in as updates that would then change a field in the Client state object.

18:32 <SeanTAllen> nemosupremo: you have complete control over your state objects and can design your own expiration of data. that said... we are going to be working on allowing state object to be "retired" if they haven't received any activity after a certain amount of time. basically a TTL.

18:33 <SeanTAllen> re: flushing, we don't yet have the concept of a "tick" so that you can take an action based on a unit of time. (like export metrics every minute). that is planned however. currently, you would check when updating your state to see if it has been at least a minute since the last time you exported this particular metric and if yes, create a new output.

18:34 <nemosupremo> That sounds about right wrt to the large bank PoC. The current idea I'm wrestling with is our current spark pipeline reads data from cassandra, performs an update and saves. The "Read from Cassandra" concept is hard to translate.

18:34 <SeanTAllen> our early use cases where in the financial space where the real time systems would run for a trading day then be shut down and restarted with fresh state the next day (fairly common in that space).

18:35 <SeanTAllen> so there are a number of features we are working on adding for other use cases.

18:35 <nemosupremo> Once the data is inside wallaroo, then I can just use Wallaroo state and not worry about it

18:35 <SeanTAllen> Ya.

18:35 <SeanTAllen> So cassandra -> kafka -> wallaroo -> kafka would be your best bet for now.

18:36 <nemosupremo> but when the application starts up, or when it sees a new device, it needs to read the historical data for that device

18:36 <SeanTAllen> that's some possible additional infrastructure so it might not be the right time for you, depending on your needs.

18:36 <SeanTAllen> how do you read in the historical data?

18:36 <SeanTAllen> how does it see a new device?

18:37 <nemosupremo> the last ~5 minutes. I don't think it would be a problem if I was starting from scratch today

18:37 <nemosupremo> or not the last 5 minutes, it reads some stored state about the device

18:37 <SeanTAllen> right now, the folks we are working with are starting from scratch

18:38 <SeanTAllen> sorry, so what i meant was, on startup, how do you know what historical data needs to be read?

18:38 <SeanTAllen> when you see a new device, how does it know there is historical data? it checks cassandra?

18:38 <nemosupremo> wouldn't it be bad if I did a query if wallaroo state update function?

18:38 <nemosupremo> Yes, it checks Cassandra

18:38 <nemosupremo> Since the Spark code doesn't persist any local data

18:39 <SeanTAllen> querying cassandra from inside the state update not optimal from a performance standpoint, but you can do it.

18:40 <nemosupremo> I would try to add some checks to prevent from always querying cassandra, like if I already have the data in Wallaroo, no need to check cassandra

18:40 <SeanTAllen> one more question: what does "see a new device" mean? basically, "i dont have a record in memory for this device"?

18:40 <nemosupremo> Right

18:40 <nemosupremo> It could be "the pipeline just started so every device is new"

18:40 <nemosupremo> or this device was just installed, so I need to pull some infromation for this device

18:41 <nemosupremo> I think my issue is, Cassandra is the source of truth for us and other services depend on Cassandra having up to date data

18:41 <nemosupremo> If I was starting fresh today, then Wallaroo could have all the device state that I write out to Cassandra

18:44 <SeanTAllen> Makes sense. You could still load all the data in from the source of truth as part of a startup and avoid the "just in time" lookup cost, but depending on constraints, that may or may not be the correct thing to do.

18:44 <SeanTAllen> You mentioned performance problems with Spark. What are you running into?

18:46 <nemosupremo> 1. Spark needs a LOT of RAM, to a point where it doesn't make sense. This inflates our server costs (which is why its a performance issue, need to use less RAM). I've written homegrown analytics frameworks before (in Go/C++), but the team here only knows Python (PySpark).

18:47 <nemosupremo> 2. We pay a huge overhead when we process 40MB of data a minute. It takes us ~30seconds currently to process this data and we need to be under a minute

18:48 <nemosupremo> I could probably investigate (2) some more, but I don't really care about that, I really want to see if I can solve (1)

18:48 <SeanTAllen> makes sense

18:49 <nemosupremo> I've written a small PoC in Go, so I know it can be done for cheaper, I just want to explore an option that lets our DS team continue to iterate in Python

18:49 <SeanTAllen> one of the goals with wallaroo has been to keep the memory overhead of the framework low, thereby allowing more room for user data

18:49 <SeanTAllen> same with latency of individual features

18:50 <SeanTAllen> well, your case, of letting the DS team work in Python is one of the use cases we want to support.

18:50 <SeanTAllen> at my previous job DS worked in Python and then had to translate it over to a JVM solution for production. There were numerous problems with that.

18:51 <nemosupremo> Question about "autoscaling" - how does that work? Are you assuming the operator has some autoscaling group in AWS (or similar) that Wallaroo can just "take advantage of"

18:52 <nemosupremo> or is Wallaroo a shared cluster, so you just have 100 machine running Wallaroo, that you can submit jobs too and those individual jobs can allocate as many resources as they want:

18:52 <nemosupremo> ?

18:53 <SeanTAllen> autoscaling allows you to change the number of processes that make up the application while running. so for example, i can add an additional process or two and wallaroo will reconfigure where state is stored to take advantage of the additional capacity.

18:53 <SeanTAllen> at the moment, it is not hooked in to any external resource managers.

18:54 <SeanTAllen> when you start up a new wallaroo process, you can give it the address of an existing member of a cluster and it will join that cluster.

18:54 <SeanTAllen> if you are using a resource manager like Yarn or Mesos then you can have another Wallaroo process started up within the managed resources.

18:55 <SeanTAllen> or you could start up additional resources some place like AWS and use your normal deployment means to install the wallaroo app and fire it up.

18:55 <SeanTAllen> There's a ton of automation tools out there, from chef and puppet et al to Mesos or Kubernetes etc.

18:56 <nemosupremo> We use mesos, so I'm thinking what the best way to integrate. Thanks for answering my questions, I hope I didn't distract you too much

18:56 <SeanTAllen> Our initial goal was to be agnostic so that folks can fit Wallaroo into whatever their company uses.

18:56 <SeanTAllen> We've started looking into an "opinionated" solution for folks who dont already have a solution in place.

18:56 <SeanTAllen> nemosupremo: we love talking to folks, learning about what their concerns are and answering their questions. Do not worry about distraction.

18:57 <SeanTAllen> question, are you also nemosupremo on GitHub?

18:57 <nemosupremo> yep

18:57 <SeanTAllen> i love your profile photo

18:57 <nemosupremo> haha its great

18:58 <SeanTAllen> well, if you have more questions, we are here to answer them!

19:05 <nemosupremo> A question about your Python bindings. I see Python 3 is in the pipeline - what are the roadblocks in working with Python 3? Are there serialization issues when working with Python 3 or something?

19:13 <SeanTAllen> Given how we work with Python, there shouldnt be much to do to get it working. We embed the Python interpreter in a Wallaroo runner application and very little of the API changed from 2 to 3. The largest amount of work would be testing and ongoing maintenance.

19:17 nemosupremo has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

19:43 nemosupremo has joined #wallaroo

21:29 <SeanTAllen> nemosupremo: would you be interested in talking to a colleague of mine more about your use case? we love to hear what sort of problems folks are having. we keep them to 15 to 30 minutes and offer an awesome Wallaroo Labs shirt and thanks in return.

21:30 <nemosupremo> Yeah, I don't mind

21:30 <SeanTAllen> awesome! thank you.

21:30 <SeanTAllen> if you email me your address at sean@wallaroolabs.com, i'll send an intro email

21:32 <nemosupremo> yep done

21:35 <SeanTAllen> intro email sent. thanks.

21:52 nemosupremo has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

22:17 nemosupremo has joined #wallaroo

22:55 aturley has quit [Read error: Connection reset by peer]

22:56 aturley has joined #wallaroo

23:32 <nemosupremo> Just wrote a Wallaroo application, and trying to run it in the docker container just leaves me with "Segmentation fault"

23:34 <nemosupremo> I'm guessing its #960

23:43 <nemosupremo> It wasn't. I was hardcoding the kafka broken address into my source code, like so - `wallaroo.KafkaSourceConfig(..., "kafka:9092", ...)`

23:43 <nemosupremo> it needed to be `wallaroo.KafkaSourceConfig(..., "kafka:9092", ...)`

23:43 <nemosupremo> This might still be #960?

23:44 <nemosupremo> import errors (due to packages not being installed) worked fine

23:44 <nemosupremo> but this one just dumped Seg fault

23:44 <nemosupremo> I thought crashes were impossible in Pony (lol)

23:45 <slfritchie> nemosupremo: They're not impossible when the FFI is used.

23:46 aturley has quit [Read error: Connection reset by peer]

23:47 aturley has joined #wallaroo

23:48 <nemosupremo> got another segfault - is there a way to enable the dumps in docker?