[00:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170106T0000). Please do the needful.
[00:00:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: puppetmaster: fail on private post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/330824
[00:02:00] <icinga-wm>	 PROBLEM - puppet last run on restbase-test1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d]
[00:02:49] <godog>	 right, so those are old hostnames which shouldn't be in icinga anyways, they don't even resolve in dns
[00:03:30] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1001 is OK: OK - running: The system is fully operational
[00:05:00] <icinga-wm>	 RECOVERY - puppet last run on restbase-test1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[00:11:43] <mutante>	 !log carbon - stopping ganglia-monitor-aggregator for good
[00:11:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:50] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:12:50] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:13:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[00:13:28] <mutante>	 !log analytics1036, ms-fe1003 - ran puppet to fix Icinga 
[00:13:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:00] <icinga-wm>	 RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[00:15:40] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[00:15:40] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[00:23:49] <wikibugs>	 (03PS7) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717)
[00:25:29] <wikibugs>	 (03PS8) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717)
[00:34:11] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[00:37:12] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:43:20] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Pass the filtered request headers to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330646 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles)
[00:43:36] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Add PoolCounter configuration to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330647 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles)
[00:44:10] <wikibugs>	 (03PS1) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717)
[00:46:36] <mutante>	 @seen jouncebot
[00:46:36] <wm-bot>	 mutante: jouncebot is in here, right now
[00:46:48] <mutante>	 jouncebot: next
[00:46:48] <jouncebot>	 In 0 hour(s) and 13 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170106T0100)
[00:47:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Add PoolCounter configuration to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330647 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles)
[00:47:20] <mutante>	 out for 13 minutes :)
[00:47:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Pass the filtered request headers to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330646 (https://phabricator.wikimedia.org/T151066) (owner: 10Gilles)
[00:59:34] <divadsn>	 Is Gerrit really down right now?
[00:59:51] <icinga-wm>	 PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.81 and port 29418: Connection refused
[00:59:51] <icinga-wm>	 PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[01:00:01] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki]
[01:00:04] <jouncebot>	 ostriches and mutante: Respected human, time to deploy Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170106T0100). Please do the needful.
[01:00:21] <godog>	 divadsn: yes scheduled maintenance
[01:00:31] <godog>	 though a !log would have been nice
[01:00:40] <divadsn>	 !log
[01:00:44] <ostriches>	 Eh, forgot to !log sorry
[01:00:51] <ostriches>	 !log gerrit: down for upgrade
[01:00:51] <icinga-wm>	 RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.13.4-13-gc0c5cc4742 (SSHD-CORE-1.2.0) (protocol 2.0)
[01:00:51] <icinga-wm>	 RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war
[01:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:06] <ostriches>	 !log gerrit: back up from upgrade
[01:01:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:01:24] <godog>	 nice, thanks ostriches 
[01:01:31] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[01:01:52] <mutante>	 now that gerrit is back, using gerrit to merge gerrit config change :)
[01:02:23] <ostriches>	 Yep, gonna do our swap to logstash during the window :)
[01:02:25] <wikibugs>	 (03PS26) 10Dzahn: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[01:02:27] <divadsn>	 godog, ah okay, thanks for info :)
[01:03:21] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/326177 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[01:04:28] <mutante>	 ostriches: any other small change that was ok and waiting?
[01:05:01] <Krinkle>	 Keep getting ':502 Bad Gateway' from RCStream?
[01:05:04] <mutante>	 _not_ the one changing database collation in connector 
[01:05:06] <Krinkle>	 Did something happen recently?
[01:05:06] <mutante>	 heh
[01:05:10] <Krinkle>	 http://codepen.io/Krinkle/pen/laucI/?editors=0010
[01:05:14] <Krinkle>	 Fails 9/10 times when I refresh
[01:05:26] <Krinkle>	 502 from ngin
[01:05:27] <Krinkle>	 x
[01:05:38] <Krinkle>	 https://stream.wikimedia.org/socket.io/1/?t=1483664718061
[01:05:55] <ostriches>	 mutante: That was the only one I had planned for this window
[01:06:03] <Krinkle>	 it's supposed to return something like "450610263156:60:60:websocket,xhr-multipart,htmlfile,jsonp-polling,flashsocket,xhr-polling" but most of the time doesn't
[01:06:06] <mutante>	 ostriches: alright... .and applied.
[01:06:31] <ostriches>	 Ok, restarting service to pick it up
[01:06:47] <Krinkle>	 !log stream.wikimedia.org problems - nginx responds with HTTP 502 Bad Gateway to most requests
[01:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:26] <ostriches>	 Hmm, service awfully slow....
[01:08:42] <mutante>	 stat1003 tried to git clone something from gerrit right during the time
[01:10:01] <icinga-wm>	 PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:10:08] <ostriches>	 mutante: Constant timeouts from gerrit, let's revert logstash for now.
[01:10:21] <mutante>	 ugh, ok
[01:11:10] <mutante>	 well, i think we need to live hack
[01:11:16] <ostriches>	 Already on it
[01:11:16] <mutante>	 so we can use gerrit to revert
[01:11:21] <mutante>	 ok
[01:11:27] <ostriches>	 Disabled puppet, reverted, restarting
[01:11:52] <ostriches>	 Yep, and service is back + snappy again
[01:11:56] <ostriches>	 So yeah, that was causing it
[01:12:15] <wikibugs>	 (03PS1) 10Dzahn: Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330830
[01:12:19] <wikibugs>	 (03PS1) 10Chad: Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330831
[01:12:24] <ostriches>	 Hah
[01:12:30] <wikibugs>	 (03Abandoned) 10Chad: Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330831 (owner: 10Chad)
[01:12:45] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "unfortunately this caused gerrit to timeout, it came back when reverting this change" [puppet] - 10https://gerrit.wikimedia.org/r/330830 (owner: 10Dzahn)
[01:12:47] <wikibugs>	 (03CR) 10Chad: [C: 031] Revert "Gerrit: Enable logstash in gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/330830 (owner: 10Dzahn)
[01:13:39] <wikibugs>	 (03PS1) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[01:13:46] <mutante>	 hehee
[01:13:48] <mutante>	 merged on master, you can enable again
[01:14:09] <mutante>	 paladox: sup
[01:14:09] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:14:15] <icinga-wm>	 PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config]
[01:14:21] <mutante>	 that stuff is because puppet wants to git clone
[01:14:29] <mutante>	 the other thing is a netsplit 
[01:14:40] <paladox>	 mutante hi, i think ostriches hit the timeout because it probaly could not hit logstash in prod
[01:14:50] <paladox>	 logstash1002.eqiad.wmnet
[01:14:57] <mutante>	 paladox: yea, first guess will be missing ferm rule?
[01:15:05] <paladox>	 yep
[01:15:20] <ostriches>	 Ok, puppet back on cobalt
[01:15:23] <paladox>	 we can always add support for logstash but not enable it by default
[01:16:03] <mutante>	 let's check the iptables rules on the destination
[01:16:04] <wikibugs>	 (03PS2) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[01:16:21] <mutante>	 make another change to fix that there if it is indeed the missing hole
[01:16:29] <paladox>	 mutante that ^^ should at least add support but have it disabled now :)
[01:16:34] <paladox>	 and ok
[01:16:35] <ostriches>	 Ok, well bummer that didn't work
[01:16:46] <mutante>	 support but not enable sounds ok'ish to me
[01:16:56] <paladox>	 yep
[01:18:08] <mutante>	 paladox: ostriches: i guess puppet/modules/role/manifests/logstash/collector.pp 
[01:18:18] <mutante>	 with all the exisiting ferm::service in there
[01:18:28] <mutante>	 collector.pp:    ferm::service { 'logstash_log4j':
[01:18:50] <mutante>	 but we looked at that and saw srange is $DOMAIN_NETWORKS
[01:18:58] <paladox>	 yep
[01:19:05] <mutante>	 so... that should have worked 
[01:19:35] <paladox>	 https://github.com/wikimedia/operations-puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/base/templates/firewall/defs.erb#L2
[01:19:39] <paladox>	 mutante ^^
[01:19:50] <bd808>	 you should be able to test with telnet right if it's a tcp connection?
[01:19:52] <mutante>	 did the log_port get set correct?
[01:20:04] <mutante>	 i see there's a variable there in that template
[01:20:10] <ostriches>	 Terrible time for a netsplit
[01:20:14] <mutante>	 lol, yea
[01:20:42] <paladox>	 mutante i think so
[01:20:44] <paladox>	 logstash1002.eqiad.wmnet
[01:20:56] <paladox>	 mutante try telnet for logstash1002.eqiad.wmnet please?
[01:21:15] <mutante>	 i am, port 4560
[01:21:21] <paladox>	 thanks
[01:21:23] <mutante>	 i connect to something, which then disconnects me again
[01:21:23] <paladox>	 yep
[01:21:28] <mutante>	 so not firewall
[01:21:33] <paladox>	 oh
[01:22:09] <ostriches>	 Freaking netsplit, terrible timing
[01:23:28] <bd808>	 mutante: logstash on logstash1002 looks busted :/
[01:23:29] <ostriches>	 Ok, caught up IRC client?
[01:23:31] <ostriches>	 Yes?
[01:23:32] <ostriches>	 Good.
[01:23:42] <bd808>	 ostriches: ohi
[01:23:46] <ostriches>	 mutante: I sent you an e-mail
[01:23:57] <ostriches>	 I'm closing the window, I don't wanna play monkey patch after 5pm
[01:24:01] <ostriches>	 We got lucky the actual upgrade was fast
[01:24:10] <ostriches>	 Let's not jinx it with the logstash thing
[01:24:16] <ostriches>	 That was never *urgent*
[01:24:31] <mutante>	 is the syntax right in this erb ?
[01:24:32] <mutante>	 https://gerrit.wikimedia.org/r/#/c/326177/26/modules/gerrit/templates/log4j.properties.erb
[01:24:51] <mutante>	 ostriches: fair, yes :)
[01:24:54] <ostriches>	 mutante: Yes, it was correct, log4j file was fine
[01:25:00] <paladox>	 mutante yep, tested locally
[01:25:10] <paladox>	 without setting the log_host it will not set any tcp
[01:25:18] <paladox>	 with log_host set it will set the tcp
[01:25:28] <paladox>	 tested with the local puppet master i have
[01:25:58] <mutante>	 alright, was wondering for a second if i should have seen the literal values in the diff:
[01:26:01] <mutante>	 -log4j.appender.tcp.Port=<%= @log_port %>
[01:26:04] <mutante>	 -log4j.appender.tcp.RemoteHost=<%= @log_host %>
[01:26:11] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[01:26:20] <ostriches>	 mutante: Um....no?
[01:26:21] <ostriches>	 I dunno
[01:26:27] <ostriches>	 It looked fine on disk earlier :D
[01:26:37] <ostriches>	 eg: log4j.appender.tcp.RemoteHost=logstash1002.eqiad.wmnet
[01:26:47] <mutante>	 ok, good
[01:27:00] <mutante>	 well, then, not jinxing it
[01:27:05] <mutante>	 but weird
[01:27:14] <bd808>	 !log Restarted logstash on logstash1002 (T154732)
[01:27:15] <mutante>	 tests were done with a puppetmaster
[01:27:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:27:19] <stashbot>	 T154732: Exception in thread "Ruby-0-Thread-18: /opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.20/lib/stud/buffer.rb:92" java.lang.UnsupportedOperationException - https://phabricator.wikimedia.org/T154732
[01:27:33] <bd808>	 that ^ may have had something to do with it
[01:27:41] <mutante>	 aah
[01:27:59] <bd808>	 looked like logstash soft died a day ago on that node :/
[01:28:00] <mutante>	 because this is a new plugin, right
[01:28:01] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[01:28:08] <mutante>	 oh, a day ago, interesting
[01:28:21] <bd808>	 we've had the java input thing for a while
[01:28:55] <bd808>	 I need to look back at sal but I think that's the same error that took logstash1001 down over the holidays
[01:29:08] <ostriches>	 bd808: First time using it, but we enabled the actual plugin a month or so ago ya
[01:29:31] <mutante>	 well, i'd be up for one more try, if you want to
[01:29:31] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[01:29:44] <bd808>	 yeah, same error as T154388
[01:29:45] <stashbot>	 T154388: No HHVM logs on kibana since 1 Jan 2017 0:00 - https://phabricator.wikimedia.org/T154388
[01:31:39] <mutante>	 runs puppet on hosts were puppet tried to git clone 
[01:31:50] <bd808>	 when service status tells you "Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable." where can you find the logs?
[01:32:39] <mutante>	 what about the classic /var/log/logstash/ there?
[01:32:57] <mutante>	 i see some gzipped ones
[01:33:01] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[01:33:12] <ostriches>	 who watches the watcher?
[01:33:13] <ostriches>	 :)
[01:34:53] <bd808>	 I was just trying to see if logstash1003 had croaked from the same thing
[01:35:12] <icinga-wm>	 RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[01:35:12] <bd808>	 and it seemed to show up in journalctl rather than the on disk logs
[01:35:42] <bd808>	 I see recent timestamps in /var/log/logstash on 1003 though so it's probably ok
[01:36:00] <bd808>	 the logstash cluster really needs an owner :/
[01:36:01] <icinga-wm>	 RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[01:36:31] <mutante>	 bd808: /run/log/journal/
[01:37:11] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[01:37:22] <bd808>	 mutante: heh. those are some fancy filenames
[01:37:32] <mutante>	 "config file is located at /etc/systemd/journald.conf and it places journal files at /var/log/journal/[machine-id]/*.journal if it exists, otherwise it places them in /run/log/journal/[machine-id]/*.journal"
[01:38:03] <paladox>	 bd808 the error you described sounds very similar to https://github.com/elastic/logstash/issues/3811
[01:38:03] <bd808>	 they are also binary noise
[01:39:28] <mutante>	 bd808: journalctl --file=  ?
[01:39:36] <mutante>	 hmm
[01:40:02] <mutante>	 Takes a file glob as an argument. If specified, journalctl will operate on the specified journal files matching GLOB instead of the default runtime and system journal paths
[01:41:27] <mutante>	 yea, that seems to work
[01:41:38] <mutante>	 journalctl --file=system@14b8e69a54b4470686e86b9017165c36-00000000001bf1ed-000544dc20f23834.journal
[01:41:50] <mutante>	 gets the Dec 30 stuff
[01:43:11] <paladox>	 https://github.com/elastic/logstash/issues/4054#issuecomment-149382295
[01:48:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1800.731944 Seconds
[01:48:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1806.479707 Seconds
[01:48:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 1806.491425 Seconds
[01:48:26] <wikibugs>	 (03PS1) 10Zhuyifei1999: Unassign 'transcode-reset' from Commons autoconfirmed and sysop groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330835 (https://phabricator.wikimedia.org/T154733)
[01:49:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 25.720363 Seconds
[01:49:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 31.5998 Seconds
[01:49:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 31.598596 Seconds
[01:50:52] <wikibugs>	 (03CR) 10Krinkle: Add DB "shard" column to logstash log entries for labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612 (owner: 10Aaron Schulz)
[01:53:23] <mutante>	 Krinkle: i dont know what this means but maybe a change in misc varnish  "upstream prematurely closed connection while reading response header from upstream"
[01:54:00] <mutante>	 a ticket would be good
[01:54:14] <mutante>	 sorry, earlier we were in the middle of gerrit upgrade and also netsplit
[01:56:09] <mutante>	 also i think analytics-ops working on replacing it
[02:14:37] <wikibugs>	 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#2922293 (10Dzahn) I moved the eqiad Ganglia aggregator from carbon to install1001 today. This part is now unblocked :)
[02:14:41] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:15:41] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[02:26:11] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[02:33:30] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack:  Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626
[02:33:32] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone:  Add more uwsgi api processes [puppet] - 10https://gerrit.wikimedia.org/r/330837
[02:33:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova:  turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838
[02:34:43] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.7) (duration: 13m 23s)
[02:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Keystone:  Add more uwsgi api processes [puppet] - 10https://gerrit.wikimedia.org/r/330837 (owner: 10Andrew Bogott)
[02:53:55] <Revent>	 _joe_: Ping?
[02:54:08] <Revent>	 elukey: You too, I guess. :)
[02:54:25] <Revent>	 https://commons.wikimedia.org/wiki/Commons:Village_pump#Temporary_change_of_user_group_rights
[02:55:11] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[03:03:00] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn)
[03:03:07] <wikibugs>	 (03PS9) 10Dzahn: icinga: use Letsencrypt for SSL cert, spend less donor money on prime numbers [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717)
[03:03:34] <mutante>	 Revent: it's a bad time, 4 AM in Europe
[03:04:24] <mutante>	 but they'll see it later i'm sure, yea
[03:04:29] <Revent>	 mutante: Yeah, it was just an FYI for the,
[03:04:32] <Revent>	 *them
[03:04:35] <mutante>	 yep
[03:05:26] <papaul>	 !log OS instalaltion  on elastic2025-elastic2036
[03:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:05:41] <mutante>	 Revent: sounds good how admins can add themselves if needed
[03:06:30] <Revent>	 Yeah, the goal really is to prevent anyone who is ‘unaware’ from poking at the button.
[03:07:48] <Revent>	 My ‘draft’ of that message (I got feedback first) rather dumbly pointed out that doing so was effectively a way to execute an untrackable DOS attack. :/
[03:10:49] <mutante>	 heh, well, it sounds good to me now. thanks for doing that
[03:22:33] <Krinkle>	 volans: godog: https://gerrit.wikimedia.org/r/#/c/327686/ :)
[03:23:01] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 784.00 seconds
[03:27:16] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2922386 (10Papaul)
[03:30:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.11 seconds
[03:33:34] <wikibugs>	 (03PS1) 10Dzahn: icinga: unbreak dependency cycle with Apache site and cert [puppet] - 10https://gerrit.wikimedia.org/r/330839
[03:37:22] <wikibugs>	 (03PS2) 10Dzahn: icinga: unbreak dependency cycle with Apache site and cert [puppet] - 10https://gerrit.wikimedia.org/r/330839
[03:38:26] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: unbreak dependency cycle with Apache site and cert [puppet] - 10https://gerrit.wikimedia.org/r/330839 (owner: 10Dzahn)
[03:42:11] <mutante>	 !log icinga - debugging issue with cert change 
[03:42:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:43:01] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-icinga]
[03:56:39] <wikibugs>	 (03PS1) 10Dzahn: icinga: Include challenge-apache.conf, exclude acme from proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/330841
[03:57:01] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[03:59:54] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: Include challenge-apache.conf, exclude acme from proto redirect [puppet] - 10https://gerrit.wikimedia.org/r/330841 (owner: 10Dzahn)
[04:10:20] <mutante>	 !log Icinga now using Letsencrypt cert and all good
[04:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:12:44] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2922451 (10demon) 05Open>03Resolved a:03demon This rolled out today
[04:13:01] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review, 07Upstream: Gerrit: Restarting gerrit could lead to data loss + maybe accounts - https://phabricator.wikimedia.org/T154205#2922457 (10demon)
[04:18:18] <wikibugs>	 (03CR) 10Dzahn: "needed follow-up to include acme-challenge snippet and exclude the challenge URL from http->https redirect https://gerrit.wikimedia.org/r/" [puppet] - 10https://gerrit.wikimedia.org/r/330633 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn)
[04:18:50] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2922470 (10Dzahn)
[04:19:08] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10Dzahn) Icinga switched to LE just now.
[04:21:12] <wikibugs>	 (03CR) 10Krinkle: build: require-dev phpunit in composer.json (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle)
[04:24:21] <wikibugs>	 (03PS2) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055
[04:24:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle)
[04:27:09] <matt_flaschen>	 !log Started FlowFixInconsistentBoards.php (production mode) on all wikis
[04:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:37:06] <matt_flaschen>	 !log Finished FlowFixInconsistentBoards.php (production mode) on all wikis
[04:37:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:38:12] <wikibugs>	 (03PS3) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947)
[04:38:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[04:56:28] <wikibugs>	 (03CR) 10Krinkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[04:56:33] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[04:57:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[04:57:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[04:57:20] <wikibugs>	 (03PS1) 10KartikMistry: Fix parameter alignment [puppet] - 10https://gerrit.wikimedia.org/r/330844
[04:58:36] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[04:59:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[05:00:47] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[05:01:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[05:18:32] <wikibugs>	 (03PS4) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055
[05:19:25] <wikibugs>	 (03PS5) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055
[05:20:08] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle)
[05:24:20] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (owner: 10Krinkle)
[05:32:31] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[05:59:32] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:28:01] <icinga-wm>	 PROBLEM - Check systemd state on graphite1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:28:11] <icinga-wm>	 PROBLEM - carbon-cache@b service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@b is failed
[06:28:31] <icinga-wm>	 PROBLEM - carbon-cache@f service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is failed
[06:35:11] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[06:49:11] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:49:33] <wikibugs>	 (03PS2) 10Muehlenhoff: Puppetise yubikey-val (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/285962
[06:50:34] <wikibugs>	 (03PS1) 10Urbanecm: Enable Extension:Babel's category on cswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330847 (https://phabricator.wikimedia.org/T67211)
[06:54:31] <icinga-wm>	 RECOVERY - carbon-cache@f service on graphite1003 is OK: OK - carbon-cache@f is active
[06:55:01] <icinga-wm>	 RECOVERY - Check systemd state on graphite1003 is OK: OK - running: The system is fully operational
[06:55:11] <icinga-wm>	 RECOVERY - carbon-cache@b service on graphite1003 is OK: OK - carbon-cache@b is active
[06:55:31] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:57:11] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[06:57:21] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[07:02:11] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[07:04:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[07:07:21] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:08:31] <moritzm>	 !log installing crypto++ security updates on trusty hosts
[07:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:22] <wikibugs>	 (03CR) 10Dereckson: [C: 031] Unassign 'transcode-reset' from Commons autoconfirmed and sysop groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330835 (https://phabricator.wikimedia.org/T154733) (owner: 10Zhuyifei1999)
[07:29:18] <wikibugs>	 (03CR) 10Dereckson: [C: 031] Enable Extension:Babel's category on cswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330847 (https://phabricator.wikimedia.org/T67211) (owner: 10Urbanecm)
[07:35:11] <icinga-wm>	 PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:55:11] <icinga-wm>	 PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:03:21] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:04:11] <icinga-wm>	 RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[08:13:21] <icinga-wm>	 PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:16:31] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK
[08:23:11] <icinga-wm>	 RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[08:30:12] <icinga-wm>	 PROBLEM - Host auth2001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:12] <icinga-wm>	 PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:12] <icinga-wm>	 PROBLEM - Host acamar is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:12] <icinga-wm>	 PROBLEM - Host 2620:0:860:1:208:80:153:12 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:21] <icinga-wm>	 RECOVERY - Host auth2001 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[08:30:31] <icinga-wm>	 RECOVERY - Host acamar is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[08:30:31] <icinga-wm>	 PROBLEM - Host cp2004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:31] <icinga-wm>	 PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:31] <icinga-wm>	 PROBLEM - Host cp2005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:51] <icinga-wm>	 PROBLEM - Host mc2005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:51] <icinga-wm>	 PROBLEM - Host mc2006 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:51] <icinga-wm>	 PROBLEM - Host ms-fe2002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:30:51] <icinga-wm>	 PROBLEM - Host ms-be2017 is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:01] <icinga-wm>	 PROBLEM - Host mc2004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:01] <icinga-wm>	 PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[08:31:21] <icinga-wm>	 RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms
[08:31:21] <icinga-wm>	 RECOVERY - Host 2620:0:860:1:208:80:153:12 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[08:31:21] <icinga-wm>	 RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[08:31:21] <icinga-wm>	 PROBLEM - Check systemd state on kafka2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:31:21] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[08:31:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - eventbus_8085 - Could not depool server kafka2002.codfw.wmnet because of too many down!: trendingedits_6699 - Could not depool server scb2004.codfw.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus2001.codfw.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs2001.codfw.wmnet because of too many do
[08:31:54] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 117, down: 1, dormant: 0, excluded: 0, unused: 0BRae1: down - Core: asw-a-codfw:ae2BR
[08:32:41] <icinga-wm>	 PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/50-server-status.conf],File[/etc/ganglia/conf.d/hhvm_mem.pyconf],File[/etc/ssh/userkeys/pybal-check]
[08:33:21] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties
[08:33:21] <icinga-wm>	 PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[screen],Package[jq],Package[zsh-beta]
[08:33:34] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 3744 threshold =0.1% breach: status: red, number_of_nodes: 24, unassigned_shards: 3648, number_of_pending_tasks: 3009, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3018, task_max_waiting_in_queue_millis: 130284, cluster_name: production-search-codfw, relocating_shards: 0, acti
[08:33:41] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[htop],Package[tcpdump],Package[gdb],Package[lldpd]
[08:35:31] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:31] <icinga-wm>	 PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:31] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:31] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:31] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:32] <icinga-wm>	 PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6
[08:35:32] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:41] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:42] <icinga-wm>	 PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6
[08:35:42] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:43] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:43] <icinga-wm>	 PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:35:44] <icinga-wm>	 PROBLEM - IPsec on cp4003 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:01] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:02] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:02] <icinga-wm>	 PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6
[08:36:03] <icinga-wm>	 PROBLEM - IPsec on mc1005 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2005_v4
[08:36:03] <icinga-wm>	 PROBLEM - IPsec on mc1006 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2006_v4
[08:36:04] <icinga-wm>	 PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6
[08:36:21] <icinga-wm>	 PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp2004_v4, cp2004_v6
[08:36:21] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:31] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2005_v4, cp2005_v6
[08:36:31] <icinga-wm>	 PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6
[08:36:31] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp2005_v4, cp2005_v6
[08:36:31] <icinga-wm>	 PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2006_v4, cp2006_v6
[08:38:00] <Dereckson>	 Hi, Vasantrao Naik Govt. Institute Of Arts and Social Sciences - Nagpur event is facing problem unable to create user accounts and their IP address seems to be 117.211.27.103
[08:38:01] <Dereckson>	 yeah, the only one we don't provide a range is the one bugged
[08:39:21] <icinga-wm>	 PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:40:05] <gehel>	 I just got an alert on elastic search codfw shards. dcausse, could you check? I'm unable to at this point...
[08:40:06] <gehel>	 Not a high emergency, this is codfw...
[08:40:06] <gehel>	 I just got an alert on elastic search codfw shards. dcausse, could you check? I'm unable to at this point...
[08:40:32] <dcausse>	 gehel: sure will have a look
[08:40:42] <moritzm>	 dcausse: gehel: seems to be a general problem with codfw/network, not specific to elastic
[08:40:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Temporarily depool codfw [dns] - 10https://gerrit.wikimedia.org/r/330848
[08:42:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Temporarily depool codfw [dns] - 10https://gerrit.wikimedia.org/r/330848 (owner: 10Giuseppe Lavagetto)
[08:42:39] <wikibugs>	 (03PS1) 10Dereckson: Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330849 (https://phabricator.wikimedia.org/T154312)
[08:42:41] <Dereckson>	 Hi. moritzm or gehel > could you deploy this? ^
[08:43:08] <_joe_>	 Dereckson: not now
[08:43:21] <icinga-wm>	 RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[08:43:37] <wikibugs>	 (03PS1) 10Ema: Route around codfw, network issues there [puppet] - 10https://gerrit.wikimedia.org/r/330850
[08:43:59] <Dereckson>	 _joe_: it's a current event (India timezone), participants are there, but they can't create account.
[08:44:17] <Dereckson>	 the change is an IP edit to a throttle rule to unblock the situation
[08:44:54] <_joe_>	 Dereckson: we're in the middle of an outage
[08:45:17] <Dereckson>	 Misinterpreted the "08:40:06 < gehel> Not a high emergency, this is codfw..."
[08:45:30] <Dereckson>	 k
[08:47:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Route around codfw, network issues there [puppet] - 10https://gerrit.wikimedia.org/r/330850 (owner: 10Ema)
[08:47:18] <_joe_>	 ema: should I merge it and run puppet in ulsfo?
[08:47:21] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties
[08:47:21] <icinga-wm>	 RECOVERY - Check systemd state on kafka2001 is OK: OK - running: The system is fully operational
[08:47:32] <ema>	 _joe_: yes please
[08:48:21] <icinga-wm>	 PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:59:41] <icinga-wm>	 RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[09:00:21] <icinga-wm>	 RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[09:00:41] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[09:01:21] <icinga-wm>	 PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:04:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[09:05:50] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330849 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson)
[09:06:07] <Dereckson>	 thanks 
[09:07:21] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:07:33] <apergos>	 it's not there yet
[09:07:38] <apergos>	 it's only merged
[09:08:04] <wikibugs>	 (03CR) 10jenkins-bot: Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330849 (https://phabricator.wikimedia.org/T154312) (owner: 10Dereckson)
[09:08:26] <Dereckson>	 Yes, but thanks to take care of it.
[09:10:01] <apergos>	 scap going now, in theory
[09:12:32] <logmsgbot>	 !log ariel@tin Synchronized wmf-config/throttle.php: Adjust throttle rule for Maharashtra 'Edit Wikipedia' workshop (VNGIASS) (duration: 02m 46s)
[09:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:35] <apergos>	 sync-apaches:  98% (ok: 297; fail: 0; left: 6)                                  
[09:12:36] <apergos>	 and waiting for awhile
[09:12:45] <apergos>	 ah just completed
[09:13:17] <Dereckson>	 I reported to the task it's done, thanks.
[09:13:26] <apergos>	 fails on 'mw2119.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw1201.eqiad.wmnet', 'mw2187.codfw.wmnet', 'mw1216.eqiad.wmnet' 
[09:13:30] <apergos>	 yw
[09:14:44] <apergos>	 fails also on 'mw1211.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1161.eqiad.wmnet'     looking at the mw1* hosts
[09:15:36] <_joe_>	 apergos: I guess we'll have to remove scap proxies that are in row A in codfw
[09:16:08] <apergos>	 if there are any more scaps befor papaul comes on line
[09:16:12] <_joe_>	 apergos: do that and re-run scap
[09:16:39] <_joe_>	 apergos: can you  look at it?
[09:17:25] <apergos>	 jouncebot: next
[09:17:25] <jouncebot>	 In 76 hour(s) and 42 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170109T1400)
[09:17:51] <apergos>	 _joe_, yes I'll take care of it
[09:18:19] <_joe_>	 eu midday scap during the summit?
[09:18:25] <_joe_>	 with almost no opsens around?
[09:18:26] <_joe_>	 mh
[09:18:34] <_joe_>	 anyways, no releases today
[09:18:40] <apergos>	 nope
[09:23:57] <p858snake|L2>	 _joe_: it was cabcelled in gregs email iirc
[09:24:00] <paravoid>	 !log asw-a7-codfw is down, serial console unresponsive
[09:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:25] <wikibugs>	 (03PS1) 10ArielGlenn: remove mw hosts in row A codfw from scap proxy list for now [puppet] - 10https://gerrit.wikimedia.org/r/330852
[09:26:15] <paravoid>	 uhh
[09:26:16] <paravoid>	 why?
[09:26:21] <paravoid>	 apergos: why?
[09:26:49] <apergos>	 so that we can have scap working for things like the above (throttle, etc)
[09:27:02] <apergos>	 paravoid: 
[09:27:03] <paravoid>	 row A codfw is working
[09:27:07] <paravoid>	 it's just A7 that's down, no mw* there
[09:27:14] <apergos>	 ah a7 right
[09:27:15] <apergos>	 hm
[09:27:18] <paravoid>	 that's one of the 10G switches
[09:27:41] <Dereckson>	 * Jan 9th: no train, SWATs only (but no one from RelEng is garaunteed to
[09:27:44] <Dereckson>	 * be around) (DevSummit+All Hands)
[09:27:55] <wikibugs>	 (03Abandoned) 10ArielGlenn: remove mw hosts in row A codfw from scap proxy list for now [puppet] - 10https://gerrit.wikimedia.org/r/330852 (owner: 10ArielGlenn)
[09:28:26] <apergos>	 I would be surprised if there are a bunch of swatters who line up for that day, but you never know
[09:30:38] <paravoid>	 so are there scap issues?
[09:31:11] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK
[09:32:10] <apergos>	 there were fails on 8 hosts, all of which tried to connect back to one of mw2080 through mw2085
[09:35:02] <paravoid>	 why though?
[09:35:16] <paravoid>	 (sorry writing the task at the same time)
[09:35:50] <apergos>	 (that's fine)
[09:39:27] <moritzm>	 apergos: these are servers which are currently decomissioned
[09:39:35] <apergos>	 hey moritz
[09:39:50] <paravoid>	 ah great, so unrelated
[09:39:52] <moritzm>	 seems Rob didn't complete these fully: https://phabricator.wikimedia.org/T154621
[09:39:55] <apergos>	 but why did the 8 servers try to ssh to them (or scp or whatever)?
[09:40:15] <moritzm>	 they're still in conftool, but he already powered them down
[09:41:02] <apergos>	 here's the failure list again: 'mw2119.codfw.wmnet', 'mw2215.codfw.wmnet', 'mw1201.eqiad.wmnet', 'mw2187.codfw.wmnet', 'mw1216.eqiad.wmnet'  'mw1211.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1161.eqiad.wmnet' 
[09:41:37] <moritzm>	 ah, not sure about those. I was referring to mw2080 through mw2085
[09:42:03] <wikibugs>	 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2922945 (10faidon)
[09:43:17] <apergos>	 yeah, I got that.  but you would think live servers would not refer to these mostly decommed ones for anything
[09:44:12] <moritzm>	 they're also still listed in puppet, so all weird side effects can happen, such decoms should really be done in one piece...
[09:46:05] <apergos>	 sigh
[09:48:33] <wikibugs>	 06Operations, 10Pybal, 10Traffic: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759#2922958 (10faidon)
[09:49:48] <wikibugs>	 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2922945 (10faidon)
[09:53:31] <icinga-wm>	 PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:53:57] <apergos>	 ok that's weird but whatever: in spite of the whines from those 8 hosts, the file has in fact been synced there, I checked md5sum agains the copy on tin :-/
[09:54:02] <apergos>	 so calling it done
[09:54:10] <paravoid>	 can you try another scap?
[09:54:35] <apergos>	 sure
[09:55:49] <apergos>	 sam file I guess, as it's harmless?
[09:56:09] <apergos>	 paravoid: 
[09:56:18] <paravoid>	 yeah sure
[09:59:01] <apergos>	 so here we are at 98% complete again, I guess it's trying ssh transport and timing out on those mw20* hosts 
[09:59:04] <apergos>	 we'll see in a minute
[10:00:27] <logmsgbot>	 !log ariel@tin Synchronized wmf-config/throttle.php: test, noop (duration: 02m 45s)
[10:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:29] <apergos>	 ah
[10:01:41] <apergos>	 gah this is what I get for not having my morning cocoa
[10:02:04] <apergos>	 these whines from the 8 hosts: they were whines from the currently configured proxies trying to get to the mw208* hosts which are 
[10:02:08] <apergos>	 half-decommissioned
[10:02:15] <apergos>	 so mystery solved
[10:02:41] <icinga-wm>	 PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:02:47] <paravoid>	 ah!
[10:02:51] <paravoid>	 great, thanks
[10:02:55] <apergos>	 sorry for the noise
[10:03:03] * apergos goes to get milk for cocoa :-D
[10:03:04] <paravoid>	 no it's alright, useful to know
[10:05:58] <paravoid>	 moritzm: did you follow up on the task about these half-decom hosts already?
[10:06:45] <moritzm>	 not yet, will add a note now
[10:08:17] <paravoid>	 thanks
[10:14:52] <wikibugs>	 (03PS2) 10DCausse: elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) (owner: 10Gehel)
[10:15:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T154765) (owner: 10Gehel)
[10:16:28] <wikibugs>	 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2918062 (10MoritzMuehlenhoff) This morning a deployment by Ariel of a mw-config  throttling change failed since scap tried to connect to mw2080-mw2085, which have been powered down...
[10:21:31] <icinga-wm>	 RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[10:28:47] <wikibugs>	 06Operations, 10Wikimedia-SVG-rendering: SVG fails to render properly due to several issues - https://phabricator.wikimedia.org/T46016#2923124 (10Aklapper) >>! In T46016#488898, @Aklapper wrote: >> 1—Failure to render masks >>    Any masked object will disappear. Clipping works. >  > Maybe related to http://bu...
[10:29:31] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK
[10:31:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257)
[10:31:41] <icinga-wm>	 RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[10:33:05] <wikibugs>	 06Operations, 10Wikimedia-SVG-rendering: SVG fails to render properly due to several issues - https://phabricator.wikimedia.org/T46016#2923126 (10MoritzMuehlenhoff) We'll unfortunately not be able to easily upgrade to 2.41.0 at this point; librsvg started to implement parts of the code in Rust and Debian jessi...
[10:41:03] <wikibugs>	 (03PS1) 10Urbanecm: [throttle] Lift for 2017-01-10+cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330855 (https://phabricator.wikimedia.org/T154312)
[10:43:54] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I'm not convinced is the right approach, see my comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[10:45:21] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK
[10:46:11] <wikibugs>	 06Operations, 10ops-codfw, 10netops: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923142 (10ema) The impact on varnish errors has been minimal. In codfw we've had two hiccups, one at 8:30 and another smaller one at 8:40 {F5241163}  In ulsfo we had a small 503 spike at 8:51 {F5241168}  We...
[11:00:11] <wikibugs>	 (03PS5) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:01:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[11:01:21] <icinga-wm>	 PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:02:18] <wikibugs>	 (03PS6) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:04:21] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[11:07:21] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:11:23] <wikibugs>	 (03PS7) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:16:20] <wikibugs>	 (03CR) 10Ema: Add NRPE check to monitor timesyncd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff)
[11:18:08] <wikibugs>	 (03Draft1) 10Paladox: Redirect /changes/ to /r/changes/ [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760)
[11:18:12] <wikibugs>	 (03Draft2) 10Paladox: Redirect /changes/ to /r/changes/ [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760)
[11:22:55] <wikibugs>	 (03PS2) 10Muehlenhoff: Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257)
[11:29:35] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 802, number_of_pending_tasks: 904, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3027, task_max_waiting_in_queue_millis: 192913, cluster_name: production-search-codfw, relocating_shards: 0, active_shards_percent_as_nu
[11:29:43] <wikibugs>	 (03PS8) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[11:30:00] <wikibugs>	 (03PS2) 10Urbanecm: [throttle] Lift for 2017-01-10/12 + minor cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330855 (https://phabricator.wikimedia.org/T154312)
[11:31:22] <icinga-wm>	 RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[11:31:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop reference to the manpage (not available on jessie) [puppet] - 10https://gerrit.wikimedia.org/r/330860
[11:33:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Drop reference to the manpage (not available on jessie) [puppet] - 10https://gerrit.wikimedia.org/r/330860 (owner: 10Muehlenhoff)
[11:34:46] <wikibugs>	 (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff)
[11:46:34] <wikibugs>	 (03PS3) 10Muehlenhoff: Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257)
[11:46:41] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:50:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add NRPE check to monitor timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330854 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff)
[11:52:06] <wikibugs>	 (03PS1) 10Ema: systemd-timesyncd: fix config file [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257)
[12:04:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257) (owner: 10Ema)
[12:08:25] <wikibugs>	 (03PS2) 10Ema: systemd-timesyncd: fix config file [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257)
[12:08:31] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] systemd-timesyncd: fix config file [puppet] - 10https://gerrit.wikimedia.org/r/330863 (https://phabricator.wikimedia.org/T150257) (owner: 10Ema)
[12:10:44] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Implement rate limiter in Thumbor - https://phabricator.wikimedia.org/T151067#2923213 (10Gilles)
[12:10:46] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#2923214 (10Gilles)
[12:12:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cache servers in ulsfo to timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330865 (https://phabricator.wikimedia.org/T150257)
[12:14:44] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[12:20:52] <wikibugs>	 (03PS1) 10Gilles: Upgrade to 0.1.32 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/330866
[12:22:20] <wikibugs>	 (03PS1) 10Gilles: Add new mandatory config value for SVG engine in Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/330867 (https://phabricator.wikimedia.org/T150754)
[12:22:30] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch swift in esams to systemd-timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330404 (https://phabricator.wikimedia.org/T150257)
[12:24:25] <wikibugs>	 (03CR) 10Ema: [C: 031] Switch cache servers in ulsfo to timesyncd [puppet] - 10https://gerrit.wikimedia.org/r/330865 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff)
[12:29:42] <wikibugs>	 (03PS1) 10Gilles: Switch Thumbor to swift loader [puppet] - 10https://gerrit.wikimedia.org/r/330869 (https://phabricator.wikimedia.org/T151441)
[12:31:38] <icinga-wm>	 PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[12:41:43] <wikibugs>	 (03CR) 10Muehlenhoff: "The admin group for datacenter ops is currently bound to the "salt::master::production" role via hieradata/role/common/salt/masters/produc" [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[12:58:36] <wikibugs>	 (03PS3) 10Paladox: Redirect /changes/ to /r/changes/ [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760)
[12:59:38] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[13:00:31] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Gerrit is exposed under /r/ the issue reported in the task is due to git-review." [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) (owner: 10Paladox)
[13:33:34] <wikibugs>	 (03PS4) 10Hashar: build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470
[13:33:37] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] build: update rubocop to 0.39 and tweak config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar)
[13:33:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar)
[13:34:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access credentials for asherman [puppet] - 10https://gerrit.wikimedia.org/r/330885 (https://phabricator.wikimedia.org/T152957)
[13:37:55] <wikibugs>	 (03PS5) 10Hashar: build: update rubocop to 0.39 and tweak config [puppet] - 10https://gerrit.wikimedia.org/r/330470
[13:38:15] <wikibugs>	 (03CR) 10Hashar: "rebased / cleared out unrelated upgrades in Gemfile.lock" [puppet] - 10https://gerrit.wikimedia.org/r/330470 (owner: 10Hashar)
[13:39:48] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[13:39:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for asherman [puppet] - 10https://gerrit.wikimedia.org/r/330885 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff)
[13:51:48] <logmsgbot>	 !log reedy@tin Started scap: Rebuild message cache for Echo api messages being missing T154110
[13:51:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:52] <stashbot>	 T154110: Echo Api Messages missing - https://phabricator.wikimedia.org/T154110
[13:53:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access credentials for laner [puppet] - 10https://gerrit.wikimedia.org/r/330891 (https://phabricator.wikimedia.org/T152957)
[13:55:28] <icinga-wm>	 PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:01:47] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 031] Remove access credentials for laner [puppet] - 10https://gerrit.wikimedia.org/r/330891 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff)
[14:07:48] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[14:09:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for laner [puppet] - 10https://gerrit.wikimedia.org/r/330891 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff)
[14:10:18] <icinga-wm>	 PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:10:28] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:11:08] <icinga-wm>	 PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:13:03] <wikibugs>	 06Operations, 10ops-codfw, 10netops, 05Security: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923393 (10Reedy)
[14:16:48] <logmsgbot>	 !log reedy@tin Finished scap: Rebuild message cache for Echo api messages being missing T154110 (duration: 25m 00s)
[14:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:53] <stashbot>	 T154110: Echo Api Messages missing - https://phabricator.wikimedia.org/T154110
[14:22:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access credentials for declerambaul [puppet] - 10https://gerrit.wikimedia.org/r/330903 (https://phabricator.wikimedia.org/T152957)
[14:24:28] <icinga-wm>	 RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[14:24:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for declerambaul [puppet] - 10https://gerrit.wikimedia.org/r/330903 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff)
[14:34:18] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[14:34:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access credentials for srijan [puppet] - 10https://gerrit.wikimedia.org/r/330904 (https://phabricator.wikimedia.org/T152957)
[14:36:32] <wikibugs>	 (03PS2) 10Tim Landscheidt: icinga: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329739
[14:37:18] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:39:59] <wikibugs>	 (03CR) 10Volans: "For the main issue see my reply inline. I'm leaving the other 2 minor comments as is." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[14:42:05] <wikibugs>	 (03PS9) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[14:44:45] <wikibugs>	 (03CR) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[14:48:39] <wikibugs>	 (03CR) 10Chad: [C: 04-1] "I don't like this, it's basically doing a workaround for broken git-review behavior. We should get upstream to fix git-review instead (or " [puppet] - 10https://gerrit.wikimedia.org/r/330858 (https://phabricator.wikimedia.org/T154760) (owner: 10Paladox)
[14:49:42] <wikibugs>	 (03PS2) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643)
[14:52:04] <wikibugs>	 (03CR) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema)
[14:52:18] <icinga-wm>	 PROBLEM - configured eth on lvs2006 is CRITICAL: eth1 reporting no carrier.
[14:52:28] <icinga-wm>	 PROBLEM - configured eth on lvs2005 is CRITICAL: eth1 reporting no carrier.
[14:52:38] <icinga-wm>	 PROBLEM - configured eth on lvs2004 is CRITICAL: eth1 reporting no carrier.
[14:52:50] <volans>	 mark: related to the reboot? ^^^
[14:53:02] <mark>	 likely yes
[14:53:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for srijan [puppet] - 10https://gerrit.wikimedia.org/r/330904 (https://phabricator.wikimedia.org/T152957) (owner: 10Muehlenhoff)
[14:53:28] <mark>	 !log papaul powercycled asw-a7-codfw 14:50
[14:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:18] <icinga-wm>	 RECOVERY - Host cp2006 is UP: PING WARNING - Packet loss = 54%, RTA = 36.06 ms
[14:55:18] <icinga-wm>	 RECOVERY - IPsec on mc1004 is OK: Strongswan OK - 1 ESP OK
[14:55:18] <icinga-wm>	 RECOVERY - Host mc2004 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[14:55:18] <icinga-wm>	 RECOVERY - Host cp2004 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[14:55:18] <icinga-wm>	 RECOVERY - Host mc2006 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[14:55:18] <icinga-wm>	 RECOVERY - Host ms-fe2002 is UP: PING OK - Packet loss = 0%, RTA = 37.33 ms
[14:55:19] <icinga-wm>	 RECOVERY - Host ms-be2017 is UP: PING OK - Packet loss = 16%, RTA = 36.17 ms
[14:55:19] <icinga-wm>	 RECOVERY - Host cp2005 is UP: PING OK - Packet loss = 16%, RTA = 36.20 ms
[14:55:20] <icinga-wm>	 RECOVERY - configured eth on lvs2006 is OK: OK - interfaces up
[14:55:28] <icinga-wm>	 RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK
[14:55:28] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK
[14:55:28] <icinga-wm>	 RECOVERY - configured eth on lvs2005 is OK: OK - interfaces up
[14:55:38] <icinga-wm>	 RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK
[14:55:38] <icinga-wm>	 RECOVERY - configured eth on lvs2004 is OK: OK - interfaces up
[14:55:38] <icinga-wm>	 RECOVERY - Host mc2005 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms
[14:55:38] <icinga-wm>	 RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK
[14:55:38] <icinga-wm>	 RECOVERY - IPsec on cp4001 is OK: Strongswan OK - 28 ESP OK
[14:55:38] <icinga-wm>	 RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK
[14:55:39] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK
[14:55:39] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK
[14:55:40] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK
[14:55:40] <icinga-wm>	 RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 28 ESP OK
[14:55:41] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK
[14:55:41] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK
[14:55:42] <icinga-wm>	 RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 28 ESP OK
[14:55:42] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK
[14:56:08] <icinga-wm>	 RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK
[14:56:08] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK
[14:56:08] <icinga-wm>	 RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK
[14:56:08] <icinga-wm>	 RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK
[14:56:08] <icinga-wm>	 RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK
[14:56:08] <icinga-wm>	 RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK
[14:56:09] <icinga-wm>	 RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK
[14:56:09] <icinga-wm>	 RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK
[14:56:10] <icinga-wm>	 RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK
[14:56:10] <icinga-wm>	 RECOVERY - IPsec on mc1005 is OK: Strongswan OK - 1 ESP OK
[14:56:11] <icinga-wm>	 RECOVERY - IPsec on mc1006 is OK: Strongswan OK - 1 ESP OK
[14:56:11] <icinga-wm>	 RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK
[14:56:12] <icinga-wm>	 RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK
[14:56:12] <icinga-wm>	 RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK
[14:57:23] <wikibugs>	 (03PS10) 10Volans: Cumin: allow connection to the targets [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588)
[14:57:28] <icinga-wm>	 PROBLEM - Freshness of OCSP Stapling files on cp2006 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2016-rsa-unified.ocsp is more than 18300 secs old!
[14:57:38] <icinga-wm>	 PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:38] <icinga-wm>	 PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:48] <icinga-wm>	 PROBLEM - Freshness of OCSP Stapling files on cp2005 is CRITICAL: CRITICAL: File /var/cache/ocsp/globalsign-2016-rsa-unified.ocsp is more than 18300 secs old!
[14:57:48] <icinga-wm>	 PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:48] <icinga-wm>	 PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:48] <icinga-wm>	 PROBLEM - puppet last run on ms-fe2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:48] <icinga-wm>	 PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:49] <icinga-wm>	 PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:57:49] <icinga-wm>	 PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): Package[htop],Package[tcpdump],Package[gdb],Package[lldpd]
[14:58:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[14:59:48] <icinga-wm>	 RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures
[14:59:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/330914
[15:00:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add retroactively assigned CVE ID [debs/linux44] - 10https://gerrit.wikimedia.org/r/330914 (owner: 10Muehlenhoff)
[15:01:38] <icinga-wm>	 RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[15:01:48] <icinga-wm>	 RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[15:03:28] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2006 is OK: OK
[15:03:38] <icinga-wm>	 RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[15:03:48] <icinga-wm>	 RECOVERY - puppet last run on ms-fe2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[15:07:14] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: rack/setup prometheus100[3-4] - https://phabricator.wikimedia.org/T152504#2923502 (10Cmjohnson)
[15:07:36] <wikibugs>	 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2923503 (10Papaul) mw2079   ge-4/0/38 mw2080   ge-3/0/0 mw2081   ge-3/0/1 mw2082  ge-3/0/2 mw2083  ge-3/0/3 mw2084  ge-3/0/4 mw2085  ge-3/0/5 mw2086  ge-3/0/6 mw2087  ge-3/0/7 mw20...
[15:07:38] <icinga-wm>	 RECOVERY - puppet last run on lvs2005 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[15:14:01] <wikibugs>	 06Operations, 10ops-eqiad: update label/racktables visible label for thumbor100[12] - https://phabricator.wikimedia.org/T153965#2923511 (10Cmjohnson) 05Open>03Resolved
[15:14:28] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:15:40] <icinga-wm>	 RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[15:16:14] <wikibugs>	 06Operations, 10ops-eqiad: update label/racktables visible label for labservices1002/WMF4075 - https://phabricator.wikimedia.org/T153967#2923515 (10Cmjohnson) 05Open>03Resolved
[15:19:48] <icinga-wm>	 RECOVERY - Freshness of OCSP Stapling files on cp2005 is OK: OK
[15:21:48] <icinga-wm>	 RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[15:22:28] <papaul>	 !log elastic2025-elastic2036 - signing puppet certs, salt-key, initial run
[15:22:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:48] <icinga-wm>	 RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[15:25:48] <icinga-wm>	 RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[15:26:29] <wikibugs>	 06Operations, 06Operations-Software-Development: Puppet compiler: order resources for easy comparison between hosts - https://phabricator.wikimedia.org/T154776#2923517 (10Volans)
[15:28:08] <wikibugs>	 06Operations, 10ops-eqiad: Rack and setup wdqs1003 - https://phabricator.wikimedia.org/T153349#2923531 (10Cmjohnson)
[15:28:12] <wikibugs>	 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: rack/setup/install wdqs1003 - https://phabricator.wikimedia.org/T152643#2923533 (10Cmjohnson)
[15:29:12] <wikibugs>	 (03CR) 10Volans: "I've fixed a bunch of issues and now the puppet compiler results seems ok to me." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/330436 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans)
[15:29:25] <cmjohnson1>	 !log powering off mw1239 to reseat DIMM
[15:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:54] <wikibugs>	 (03PS3) 10Ema: varnishstatsd: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/330668 (https://phabricator.wikimedia.org/T151643)
[15:35:31] <elukey>	 thanks cmjohnson1!
[15:36:29] <wikibugs>	 06Operations, 10ops-eqiad: mw1239: memory scrubbing error - https://phabricator.wikimedia.org/T148421#2923539 (10Cmjohnson) @elukey DIMM A1 swapped with B1. Let's see what happens
[15:40:45] <wikibugs>	 06Operations, 10ops-codfw, 10netops, 05Security: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923547 (10mark) The switch has been brought back up with a hard power cycle.  We don't have a real indication yet of why it crashed, there's nothing concrete in the logs (local or otherwise),...
[15:41:00] <wikibugs>	 06Operations, 10ops-codfw, 10netops, 05Security: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758#2923548 (10mark) p:05Unbreak!>03Normal
[15:41:49] <wikibugs>	 (03PS1) 10Cmjohnson: Removing final dns entries for mw1017 and mw1099 T151303 [dns] - 10https://gerrit.wikimedia.org/r/330918
[15:42:28] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[15:46:34] <wikibugs>	 06Operations: setup/install mwlog1001/WMF4724 - https://phabricator.wikimedia.org/T153361#2923553 (10Cmjohnson)
[15:49:53] <wikibugs>	 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#2923554 (10Cmjohnson) @robh @fgiunchedi Both of the servers on running on one PSU.  Both are very much out of warranty.  Is there any action to take here?
[15:50:08] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Removing final dns entries for mw1017 and mw1099 T151303 [dns] - 10https://gerrit.wikimedia.org/r/330918 (owner: 10Cmjohnson)
[15:50:53] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151295#2923559 (10Cmjohnson)
[15:50:55] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, 15User-Joe: Hardware decommission mw1017, mw1099 - https://phabricator.wikimedia.org/T151303#2923557 (10Cmjohnson) 05Open>03Resolved Removed the remaining DNS entries.
[15:51:00] <wikibugs>	 06Operations, 10ops-eqiad, 10Cassandra, 06Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#2923560 (10RobH) Are there any pending decom hosts unracked that have the same power supplies so we can steal them?
[15:58:49] <wikibugs>	 (03PS1) 10Papaul: Add elastic2025-elastic2036 Bug: T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330923 (https://phabricator.wikimedia.org/T154251)
[15:59:15] <wikibugs>	 06Operations, 10ops-eqiad, 10Cassandra, 13Patch-For-Review, 06Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#2923568 (10Cmjohnson)
[15:59:17] <wikibugs>	 06Operations, 10ops-eqiad: Rename/relabel restbase-test1* to restbase-dev1* - https://phabricator.wikimedia.org/T154629#2923566 (10Cmjohnson) 05Open>03Resolved Switch ports, racktables updated...resolving
[15:59:43] <papaul>	 volans: can you please review and mere this 
[15:59:50] <papaul>	 volans: https://gerrit.wikimedia.org/r/#/c/330923/
[15:59:53] <papaul>	 volans: thanks
[15:59:58] * volans looking
[16:01:58] <icinga-wm>	 PROBLEM - Host 208.80.155.118 is DOWN: CRITICAL - Host Unreachable (208.80.155.118)
[16:02:48] <icinga-wm>	 PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.155.117)
[16:03:46] <chasemp>	 ^ andrewbogott?
[16:04:05] <andrewbogott>	 chasemp: ah yeah, just missed an alert to silence...
[16:04:18] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[16:04:57] <wikibugs>	 (03CR) 10Volans: [C: 032] Add elastic2025-elastic2036 Bug: T154251 [puppet] - 10https://gerrit.wikimedia.org/r/330923 (https://phabricator.wikimedia.org/T154251) (owner: 10Papaul)
[16:05:19] <ema>	 !log wiping codfw caches T154758
[16:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:52] <stashbot>	 T154758: asw-a7-codfw is down - https://phabricator.wikimedia.org/T154758
[16:06:59] <wikibugs>	 (03PS3) 10Hashar: beta::autoupdater: Stop wmf-beta-mwconfig-update being a template just to get the staging dir [puppet] - 10https://gerrit.wikimedia.org/r/322408 (owner: 10Alex Monk)
[16:07:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 4.4.40 [debs/linux44] - 10https://gerrit.wikimedia.org/r/330926
[16:07:10] <volans>	 papaul: merged and puppet-merged
[16:07:18] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:07:30] <wikibugs>	 (03CR) 10Hashar: [C: 031] "I have cherry picked it on the beta cluster puppet master.  Ran the Jenkins build and it seems all happy https://integration.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/322408 (owner: 10Alex Monk)
[16:07:42] <papaul>	 volans: thanks will try again after a get some food 
[16:07:52] <volans>	 sure! :)
[16:07:57] <volans>	 hope it works
[16:08:59] <papaul>	 volans: i just did one to be sure yes it is working
[16:09:03] <papaul>	 volans: thanks
[16:09:34] <volans>	 yw
[16:09:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - uploadlb_443 - Could not depool server cp2026.codfw.wmnet because of too many down!
[16:09:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2005 is CRITICAL: PYBAL CRITICAL - uploadlb_443 - Could not depool server cp2008.codfw.wmnet because of too many down!: uploadlb6_443 - Could not depool server cp2026.codfw.wmnet because of too many down!
[16:10:00] <volans>	 ema: ^^^
[16:10:50] <ema>	 volans: that's probably because of the cache wipes, codfw is depooled anyways
[16:11:04] <volans>	 I know it's depooled, just FYI
[16:11:11] <ema>	 yeah thanks :)
[16:11:18] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[16:11:28] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[16:11:43] <ema>	 uh
[16:12:18] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:18] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:19] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:19] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:19] <icinga-wm>	 PROBLEM - Swift HTTP frontend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:29] <volans>	 wat?
[16:12:38] <icinga-wm>	 PROBLEM - Swift HTTP frontend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - swift_80 - Could not depool server ms-fe1003.eqiad.wmnet because of too many down!
[16:12:44] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:12:48] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - swift_80 - Could not depool server ms-fe1003.eqiad.wmnet because of too many down!
[16:12:49] <ema>	 eqiad?
[16:12:50] <Krenair>	 that doesn't look good
[16:13:08] <icinga-wm>	 RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[16:13:09] <icinga-wm>	 RECOVERY - Host 208.80.155.118 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[16:13:09] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0]
[16:13:09] <icinga-wm>	 PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:13:09] <icinga-wm>	 PROBLEM - Swift HTTP frontend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:13:15] <wikibugs>	 (03CR) 10Hashar: [C: 031] toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn)
[16:13:19] <Krenair>	 though I can still view images
[16:13:27] <_joe_>	 cached ones I guess
[16:14:03] <Krenair>	 hmm
[16:14:04] <godog>	 looking now too
[16:14:14] <Krenair>	 yeah, there are issues
[16:14:18] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0]
[16:14:23] <Krenair>	 I tried some different sizes but I guess all the ones I happened to try were cached
[16:14:40] <Krenair>	 a 322px version gives HTTP 503
[16:14:48] <icinga-wm>	 PROBLEM - Check HHVM threads for leakage on mw1295 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers
[16:14:55] <hashar>	 !log Restarting Nodepool
[16:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:01] <_joe_>	 something bad is happening and I'm not sure what
[16:15:34] <wikibugs>	 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure, 07Wikimedia-Incident: Replace fans (or paste) on labservices1001 - https://phabricator.wikimedia.org/T154391#2923595 (10Cmjohnson) @Andrew Replaced the thermal paste on labservices1001....it didn't look dry and crusty so not 100% it will fix the i...
[16:15:40] <paladox>	 Images are down https://commons.wikimedia.org/wiki/File:The_Adoration_of_the_Magi_(Matthias_Stom)_-_Nationalmuseum_-_18796.jpg
[16:15:45] <godog>	 I merged a change yesterday to rewrite.py, it might a side effect of that
[16:15:47] <_joe_>	 paladox: we know
[16:15:48] <icinga-wm>	 RECOVERY - Check HHVM threads for leakage on mw1295 is OK: OK
[16:15:57] <_joe_>	 godog: no request is getting to the scalers
[16:15:58] <Krenair>	 one I just tried said "Error creating thumbnail: File missing"?
[16:15:59] <paladox>	 oh ok
[16:16:08] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.020 second response time
[16:16:14] <paladox>	 I get If you report this error to the Wikimedia System Administrators, please include the details below.
[16:16:14] <paladox>	 Request from 104.131.110.123 via cp1072 cp1072, Varnish XID 104333370
[16:16:14] <paladox>	 Error: 503, Backend fetch failed at Fri, 06 Jan 2017 16:16:06 GMT
[16:17:23] <godog>	 !log bounce swift-proxy on ms-fe100[123] leave ms-fe1004 for investigation
[16:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:18] <icinga-wm>	 PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:21:25] <SPF|Cloud>	 Although this PHP script (/w/index.php) exists, the file requested for output (mwstore://global-swift-eqiad/captcha-render/5/a/f/image_aa6f9790_5af00d48c7134951.png) does not.
[16:21:37] <SPF|Cloud>	 guess because of the swift issues?
[16:21:41] <_joe_>	 yes
[16:22:07] <SPF|Cloud>	 ok, I'll wait until it's fixed.
[16:22:28] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-fe1002 is CRITICAL: CRITICAL: nf_conntrack is 90 % full
[16:24:18] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:25:08] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-fe1003 is CRITICAL: CRITICAL: nf_conntrack is 90 % full
[16:25:28] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-fe1002 is OK: OK: nf_conntrack is 44 % full
[16:25:36] <moritzm>	 I'm temporatily bumping conntrack table sizes on ms-fe1*
[16:27:03] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-fe1003 is OK: OK: nf_conntrack is 44 % full
[16:28:33] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2025 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[16:28:54] <dcausse>	 ^
[16:29:03] <dcausse>	 is a new elastic node
[16:29:13] <icinga-wm>	 PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[16:29:53] <icinga-wm>	 PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 57 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[16:30:57] <ShakespeareFan00>	 Hi
[16:31:07] <Krenair>	 image issues ShakespeareFan00?
[16:31:09] <ShakespeareFan00>	 You probably alrready know this
[16:31:15] <Krenair>	 yes
[16:31:21] <Penskins`>	 Someone in -commons also just complained
[16:31:23] <ShakespeareFan00>	 but I'm getting a backend error when accessing images...
[16:31:24] <Penskins`>	 fwiw
[16:31:33] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2026 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2026.codfw.wmnet
[16:31:49] <Krenair>	 dcausse, these elasticsearch https alerts in codfw nothing to worry about?
[16:31:53] <ShakespeareFan00>	 Not a major issue, but a pain when you need to do proof-reading on Wikisource which needs the scan images
[16:32:07] <dcausse>	 Krenair: nope these are new nodes being racked
[16:32:07] <wikibugs>	 (03PS1) 10RobH: decom mw2080-2085 [puppet] - 10https://gerrit.wikimedia.org/r/330930
[16:32:55] <wikibugs>	 (03CR) 10RobH: [C: 032] decom mw2080-2085 [puppet] - 10https://gerrit.wikimedia.org/r/330930 (owner: 10RobH)
[16:34:55] <urandom>	 hashar: https://doc.wikimedia.org/puppet is nice!
[16:35:43] <wikibugs>	 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2923626 (10RobH) >>! In T154621#2923111, @MoritzMuehlenhoff wrote: > This morning a deployment by Ariel of a mw-config  throttling change failed since scap tried to connect to mw20...
[16:36:13] <urandom>	 hashar: do you know why the class docstring shows here https://doc.wikimedia.org/puppet/puppet_classes/restbase.html, but not here https://doc.wikimedia.org/puppet/puppet_classes/cassandra.html?  is it that extra new line after before the class definition?
[16:36:28] <nuria>	 bblack: friendly remainder that as you mentioned we will plan for this work to be done in q3: https://phabricator.wikimedia.org/T138027
[16:37:13] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0]
[16:37:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Depooling ulsfo upload caches [dns] - 10https://gerrit.wikimedia.org/r/330931
[16:37:55] <_joe_>	 mark, ema ^^
[16:38:13] <_joe_>	 but esams and eqiad have the same issue
[16:38:20] <_joe_>	 ganglia lost them
[16:38:23] <_joe_>	 so maybe it's not that
[16:38:28] <mark>	 not sure this is helpful really
[16:42:30] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#2923650 (10Papaul)
[16:42:32] <wikibugs>	 06Operations, 10ops-codfw: update the label and racktables entry for gerrit2001/WMF6408 & install SSDs - https://phabricator.wikimedia.org/T152527#2923648 (10Papaul) 05Open>03Resolved Disks installation complete.
[16:43:33] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] elasticsearch tool [software/elasticsearch-tool] - 10https://gerrit.wikimedia.org/r/309573 (owner: 10Gehel)
[16:44:57] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:45:15] <mutante>	 w
[16:46:34] <wikibugs>	 06Operations, 10ops-codfw: update the label and racktables entry for gerrit2001/WMF6408 & install SSDs - https://phabricator.wikimedia.org/T152527#2923676 (10demon) Awesome thanks!
[16:46:48] <wikibugs>	 (03CR) 10Jforrester: [C: 031] "We should get this done sooner rather than later to avoid T154110-like issues." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328482 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy)
[16:46:48] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 863 bytes in 0.348 second response time
[16:47:24] <wikibugs>	 06Operations, 10ops-codfw: decommission old mw appservers to make room for new systems - https://phabricator.wikimedia.org/T154621#2923679 (10RobH)
[16:50:26] <wikibugs>	 (03PS1) 10RobH: decom mw2075-2089 [dns] - 10https://gerrit.wikimedia.org/r/330937
[16:50:41] <halfak>	 Hey folks, I'm struggling to upload a patchset against a non-default branch for a repo in gerrit. I see that the puppet repo has more than one branch maintained so I was hoping that someone from operations could give me some pointers. 
[16:50:46] <halfak>	 Ops puppet branches: https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet,branches
[16:51:01] <halfak>	 ORES wheels branches: https://gerrit.wikimedia.org/r/#/admin/projects/research/ores/wheels,branches
[16:51:03] <wikibugs>	 (03CR) 10RobH: [C: 032] decom mw2075-2089 [dns] - 10https://gerrit.wikimedia.org/r/330937 (owner: 10RobH)
[16:51:43] <halfak>	 When I try to "git review -R wmflabs", I get http://pastebin.ca/3753822
[16:52:33] <halfak>	 My local branch (called update_libraries) is based on the wmflabs branch and has one additional commit.  
[16:53:07] <chasemp>	 halfak: outtage reasoning and response is in progress for awhile I imagine fyi
[16:53:48] <halfak>	 chasemp, there's an outage in progress? 
[16:54:38] <chasemp>	 halfak: image serving is/was down and is/was under extreme duress yes (see topic)
[16:54:44] <halfak>	 gotcha. 
[16:54:57] <halfak>	 Thanks for the heads up. 
[16:59:26] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2028 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2028.codfw.wmnet
[16:59:46] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2034 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2034.codfw.wmnet
[16:59:46] <icinga-wm>	 PROBLEM - puppet last run on elastic2036 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/lib/json-simple.jar],Package[elasticsearch/plugins]
[16:59:56] <icinga-wm>	 PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:00:16] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2027 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2027.codfw.wmnet
[17:00:16] <icinga-wm>	 PROBLEM - puppet last run on elastic2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[17:00:16] <icinga-wm>	 RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:00:26] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2032 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2032.codfw.wmnet
[17:00:26] <icinga-wm>	 PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/elasticsearch/lib/json-simple.jar],Package[elasticsearch/plugins]
[17:00:56] <icinga-wm>	 PROBLEM - puppet last run on elastic2028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[17:01:06] <icinga-wm>	 PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:01:06] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2031 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2031.codfw.wmnet
[17:01:06] <icinga-wm>	 PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[17:01:09] <chasemp>	 gehel: dcausse ^
[17:01:26] <icinga-wm>	 PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:01:36] <icinga-wm>	 PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[17:01:47] <volans>	 chasemp: those are new hosts
[17:01:49] <dcausse>	 chasemp: these are new nodes, is it possible to silence all icinga alerts for all elastic2025 and up?
[17:01:56] <icinga-wm>	 PROBLEM - Check systemd state on elastic2035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:01:56] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2030 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2030.codfw.wmnet
[17:01:56] <icinga-wm>	 PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[17:01:56] <icinga-wm>	 PROBLEM - Check systemd state on elastic2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:02:06] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2036 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2036.codfw.wmnet
[17:02:06] <icinga-wm>	 RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational
[17:02:14] <volans>	 papaul: if you're not around I'll silence them
[17:02:26] <chasemp>	 dcausse: I don't know of a way to silence hosts before they show up like that, you gotta try to catch it afaik :)
[17:02:36] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2029 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2029.codfw.wmnet
[17:02:36] <icinga-wm>	 PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 24 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[elasticsearch/plugins]
[17:02:40] <chasemp>	 but cool no worries, just pinging to make sure
[17:02:46] <dcausse>	 sure
[17:02:46] <icinga-wm>	 RECOVERY - puppet last run on elastic2036 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[17:02:46] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[17:03:40] <wikibugs>	 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2923787 (10Niharika) Hi @Shoichi is the translation work currently in progress?
[17:03:48] <chasemp>	 thanks volans
[17:05:06] <wikibugs>	 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2923788 (10RobH) a:05RobH>03Papaul
[17:05:06] <icinga-wm>	 PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:05:49] <wikibugs>	 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2918062 (10RobH) This task is now assigned to @papaul for the disk wipes.  Once the disks are wiped and the systems are pulled from the racks, I'll remove their network port entries and...
[17:05:57] <volans>	 chasemp, dcausse, papaul: I've set them in downtime until monday around this time, increase it if needed
[17:06:08] <wikibugs>	 06Operations, 10ops-codfw: decommission mw2075-2089 to make room for new systems - https://phabricator.wikimedia.org/T154621#2923807 (10RobH)
[17:06:20] <chasemp>	 volans: you're a scholar and gentlemen
[17:06:20] <dcausse>	 volans: thanks!
[17:06:47] <volans>	 :)
[17:06:58] <volans>	 you're welcome, sir ;)
[17:08:20] <papaul>	 volans: thanks
[17:09:16] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2923813 (10Papaul)
[17:09:36] <icinga-wm>	 RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[17:09:36] <icinga-wm>	 RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[17:09:53] <wikibugs>	 06Operations, 10ops-codfw, 06Discovery, 06Discovery-Search, 10Elasticsearch: rack/setup/install elastic2025-2036 - https://phabricator.wikimedia.org/T154251#2905389 (10Papaul) a:05Papaul>03Gehel @Gehel you can take over.
[17:10:37] <ShakespeareFan00>	 Any ETA on the image backend being up?
[17:11:56] <andre__>	 It's ready when it's fixed?
[17:12:16] <icinga-wm>	 RECOVERY - puppet last run on elastic2029 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[17:12:56] <_joe_>	 ShakespeareFan00: not really, sorry
[17:13:16] <ShakespeareFan00>	 andre__: OKay not a major concern
[17:13:18] <_joe_>	 we're still trying to figure out how to mitigate this
[17:13:27] <ShakespeareFan00>	 Also saving text is really slow for me right now
[17:13:40] <ShakespeareFan00>	 And I had an https connection failure
[17:14:50] <_joe_>	 ShakespeareFan00: related, sadly
[17:15:06] <_joe_>	 but text is partly uncached but should work mostly fine
[17:16:45] <ShakespeareFan00>	 Also when I KNOW I've edited a page the changes aren't propogating - https://en.wikipedia.org/w/index.php?title=Special%3AWhatLinksHere&limit=500&hidelinks=1&target=Template%3Affdc&namespace=
[17:16:55] <ShakespeareFan00>	 containg links I KNOW I've removed.
[17:17:02] <ShakespeareFan00>	 (links or template inclusions)
[17:17:28] <jynus>	 jobque is not affected
[17:17:38] <jynus>	 but in any case, be patient
[17:19:56] <icinga-wm>	 RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[17:21:56] <icinga-wm>	 RECOVERY - puppet last run on elastic2028 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[17:22:27] <_joe_>	 yeah an unpredicted event wiped out most of our caches
[17:23:12] <wikibugs>	 (03CR) 10Chad: Gerrit: Enable logstash in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[17:23:16] <icinga-wm>	 RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[17:25:18] <wikibugs>	 (03CR) 10Chad: Gerrit: Enable g1 gc as we now use java 8 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/327763 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox)
[17:25:31] <wikibugs>	 (03PS3) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[17:25:40] <wikibugs>	 (03PS4) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[17:26:00] <wikibugs>	 (03CR) 10Chad: "useUnicode is ok. Do we want/need the connectionCollation change yet since we haven't adjusted the DB?" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[17:26:02] <wikibugs>	 (03CR) 10Paladox: Gerrit: Enable logstash in gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[17:26:33] <Krenair>	 (slowly)
[17:27:21] <wikibugs>	 (03CR) 10Chad: "One minor nit, otherwise this is fine" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[17:27:31] <wikibugs>	 (03CR) 10Chad: "We should go ahead with this." [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[17:27:57] <icinga-wm>	 RECOVERY - Check systemd state on elastic2035 is OK: OK - running: The system is fully operational
[17:27:59] <wikibugs>	 (03CR) 10Paladox: "> useUnicode is ok. Do we want/need the connectionCollation change" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[17:28:17] <icinga-wm>	 RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[17:28:27] <icinga-wm>	 RECOVERY - puppet last run on elastic2035 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[17:28:50] <wikibugs>	 (03CR) 10Paladox: "Is there a date? Should we book a date for this as i think gerrit needs to be down for this." [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[17:28:57] <icinga-wm>	 RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[17:29:57] <icinga-wm>	 PROBLEM - Check systemd state on elastic2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:29:58] <wikibugs>	 (03CR) 10Paladox: "> Is there a date? Should we book a date for this as i think gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[17:30:31] <wikibugs>	 (03CR) 10Chad: "I think this can be abandoned. Looking at the task, it looks like we'll be going in a different direction--looks like we'll allow a policy" [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox)
[17:30:47] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic2033 is CRITICAL: SSL CRITICAL - failed to verify search.svc.codfw.wmnet against elastic2033.codfw.wmnet
[17:30:57] <wikibugs>	 (03CR) 10Paladox: Add support for searching gerrit using bug:T1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[17:31:11] <wikibugs>	 (03Abandoned) 10Paladox: phabricator: allow mirroring from git.legoktm.com into Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) (owner: 10Paladox)
[17:31:29] <wikibugs>	 (03CR) 10Chad: "Would need a date for the brief downtime for applying the config, applying the change and bringing back up. Shouldn't take more than 5-10 " [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[17:31:44] <volans>	 dcausse, papaul: I guess we have the SSL checks too :(
[17:31:58] <wikibugs>	 (03CR) 10Chad: [C: 031] "Silly gerrit. Fair enough :)" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[17:32:15] <dcausse>	 there's no way to silence all alerts from a particular host? :/
[17:32:45] <wikibugs>	 (03CR) 10Paladox: "Yep lol." [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[17:33:08] <volans>	 dcausse: I did, probably 2033 come up just now
[17:33:17] <volans>	 it doesn't have the downtime
[17:33:23] <dcausse>	 oh ok
[17:33:48] <wikibugs>	 (03PS5) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[17:34:18] <volans>	 dcausse: added there too
[17:34:23] <dcausse>	 thanks :)
[17:40:27] <wikibugs>	 (03CR) 10Paladox: "> Would need a date for the brief downtime for applying the config," [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[17:42:45] <wikibugs>	 (03CR) 10Paladox: "@Chad can this be merged? I disabled the host part so logstash won't be enabled by default." [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[17:43:34] <wikibugs>	 (03PS6) 10Paladox: Gerrit: Enable logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[17:45:58] <wikibugs>	 (03CR) 10Paladox: "We will want to keep an eye out for a week to make sure this does not cause any side affects + make sure every's usernames are converted t" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[17:47:35] <wikibugs>	 (03PS11) 10Hashar: Modification of Rakefile spec entry point [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko)
[17:48:09] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: swift: move thumbs handling to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/330950
[17:48:50] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Depooling ulsfo upload caches [dns] - 10https://gerrit.wikimedia.org/r/330931 (owner: 10Giuseppe Lavagetto)
[17:49:59] <wikibugs>	 (03PS4) 10Hashar: Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223
[17:50:17] <icinga-wm>	 PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:51:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Use task to run modules spec [puppet] - 10https://gerrit.wikimedia.org/r/307223 (owner: 10Hashar)
[17:54:08] <wikibugs>	 (03PS21) 10Dzahn: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[17:56:08] <wikibugs>	 (03CR) 10Tim Landscheidt: "> […]" [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis)
[17:59:06] <darev>	 hi
[18:00:09] <wikibugs>	 (03CR) 10BBlack: [C: 031] swift: move thumbs handling to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/330950 (owner: 10Giuseppe Lavagetto)
[18:00:55] <wikibugs>	 (03Abandoned) 10Tim Landscheidt: WIP: Add BigBrotherMonitor [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/233338 (owner: 10Tim Landscheidt)
[18:00:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] swift: move thumbs handling to codfw temporarily [puppet] - 10https://gerrit.wikimedia.org/r/330950 (owner: 10Giuseppe Lavagetto)
[18:01:16] <wikibugs>	 (03PS1) 10Dzahn: delete icinga SSL cert, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/330957
[18:01:26] <wikibugs>	 (03Abandoned) 10Tim Landscheidt: Tools: Migrate from bigbrother to bigbrothermonitor [puppet] - 10https://gerrit.wikimedia.org/r/234051 (owner: 10Tim Landscheidt)
[18:02:45] <wikibugs>	 (03CR) 10Dzahn: "when doing this, the key should be deleted in private repo too" [puppet] - 10https://gerrit.wikimedia.org/r/330957 (owner: 10Dzahn)
[18:03:27] <wikibugs>	 (03PS3) 10Dzahn: icinga: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329739 (owner: 10Tim Landscheidt)
[18:03:47] <icinga-wm>	 PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:04:27] <icinga-wm>	 RECOVERY - Check systemd state on restbase-test1001 is OK: OK - running: The system is fully operational
[18:04:42] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: Indent @ssl_settings in Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/329739 (owner: 10Tim Landscheidt)
[18:07:27] <icinga-wm>	 PROBLEM - Check systemd state on restbase-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:07:27] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923652 (10jcrespo)
[18:08:01] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923652 (10jcrespo) @Aklapper The problems have been minimized, but this is still WIP.
[18:08:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Puppetmaster:  Remove remnants of ldap node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/330959 (https://phabricator.wikimedia.org/T148781)
[18:08:31] <godog>	 !log force puppet run on eqiad cache_upload to switch thumbs to codfw
[18:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:52] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923953 (10jcrespo)
[18:10:42] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923652 (10jcrespo)
[18:10:47] <icinga-wm>	 RECOVERY - Swift HTTP frontend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time
[18:10:47] <icinga-wm>	 RECOVERY - Swift HTTP frontend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time
[18:10:47] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.011 second response time
[18:11:03] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on ms-fe.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.010 second response time
[18:11:04] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[18:11:08] <apergos>	 nice
[18:11:14] * apergos crosses fingers
[18:11:17] <icinga-wm>	 RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time
[18:11:17] <icinga-wm>	 RECOVERY - Swift HTTP frontend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.001 second response time
[18:11:17] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.009 second response time
[18:11:17] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.009 second response time
[18:11:17] <icinga-wm>	 RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.008 second response time
[18:11:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[18:11:41] <_joe_>	 apergos: we just shifted some traffic
[18:11:50] <apergos>	 yep been lurking
[18:12:03] <apergos>	 let's hope codfw handles it smoothly
[18:13:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.032 second response time
[18:13:17] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.044 second response time
[18:13:17] <icinga-wm>	 RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 75216 bytes in 0.124 second response time
[18:13:25] <mutante>	 !log mw1205 - restarted hhvm
[18:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:08] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2923982 (10Elisfkc) 05Open>03Resolved a:03Elisfkc Got upload wizard and Flickr2Common...
[18:15:47] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0]
[18:17:26] <wikibugs>	 (03PS2) 10Andrew Bogott: Puppetmaster:  Remove remnants of ldap node definitions. [puppet] - 10https://gerrit.wikimedia.org/r/330959 (https://phabricator.wikimedia.org/T148781)
[18:19:17] <icinga-wm>	 RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[18:22:06] <wikibugs>	 (03PS2) 10Tim Landscheidt: Tools: Undo obsolete /var/mail customization [puppet] - 10https://gerrit.wikimedia.org/r/326306
[18:22:27] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:23:27] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:23:28] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[18:24:27] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:32:24] <Revent>	 _joe_: Well, whatever you guys broke had a quite positive effect on videoscaler ‘load’, lol.
[18:32:47] <icinga-wm>	 RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[18:32:50] <Krenair>	 some uploads were failing for a while
[18:33:03] <Krenair>	 probably helped things on the videoscalers
[18:34:39] <Revent>	 Krenair: Every queued transcode received a ‘transcode_time_error’ at about 201701061611
[18:34:40] <wikibugs>	 (03PS4) 10Tim Landscheidt: Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795)
[18:35:23] <Krenair>	 they all timed out at the same time, when the issues started to occur?
[18:36:06] <Revent>	 Krenair: It appears to be the same behavior as the when the scalers were rebooted last month.
[18:36:18] <Revent>	 (in that they all got an error time)
[18:36:29] <Revent>	 https://quarry.wmflabs.org/query/14842
[18:36:52] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Set SYS_UID_MAX and SYS_GID_MAX to 499 [puppet] - 10https://gerrit.wikimedia.org/r/326311 (https://phabricator.wikimedia.org/T45795) (owner: 10Tim Landscheidt)
[18:39:02] <wikibugs>	 (03PS2) 10Andrew Bogott: toollabs: remove host aliases for tools-exec-121[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/330333 (https://phabricator.wikimedia.org/T154539) (owner: 10BryanDavis)
[18:39:06] <Krenair>	 Revent, does it record the error somewhere?
[18:39:23] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2924063 (10czar) Special:Upload is back up for me after being down for a while (T154790). I...
[18:40:21] <Revent>	 Krenair: There is a error field (that I am not displaying in that search) but it looks like they is all ‘source not found'
[18:40:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] toollabs: remove host aliases for tools-exec-121[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/330333 (https://phabricator.wikimedia.org/T154539) (owner: 10BryanDavis)
[18:41:24] <Krenair>	 strange error
[18:41:32] <Revent>	 https://quarry.wmflabs.org/query/15278
[18:41:44] <Revent>	 You can see it there on a couple of the transcodes.
[18:41:46] <Krenair>	 but it's possible this is just what happened when Swift was overwhelmed
[18:42:41] <Revent>	 Krenair: TBH, if it ‘just happened’ to choose today to crap out everything in the queue… that’s not necessarily a bad thing.
[18:43:22] <Revent>	 https://phabricator.wikimedia.org/T154733 <- because of what this is meant to address
[18:43:53] <Krenair>	 I doubt it's unrelated
[18:48:54] <wikibugs>	 (03PS22) 10Dzahn: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[18:51:44] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[18:52:09] <Krenair>	 Revent, do you need to re-queue stuff?
[18:52:36] <Revent>	 Krenair: Eventually, yeah, but kinda waiting to watch what’s happening.
[18:52:44] <Krenair>	 yep
[18:52:50] <Revent>	 (I can’t see the ‘real’ queue)
[18:52:53] <wikibugs>	 (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox)
[18:52:58] <Krenair>	 I imagine there'll be a report at some point
[18:53:27] <Revent>	 Krenair: Can you check what the backlog in the ‘real’ queue looks like?
[18:53:46] <Krenair>	 I don't really know enough about the video scaling systems
[18:53:47] <Krenair>	 sorry
[18:54:20] <wikibugs>	 (03PS2) 10Tim Landscheidt: Tools: Update list of host aliases for mail relay [puppet] - 10https://gerrit.wikimedia.org/r/326308
[18:54:51] <Revent>	 (nods) Apparently, the ‘status’ of the file in the DB as ‘queued’ is not what controls them actually being run, there is a actual software queue somewhere.
[18:55:13] <bd808>	 probably job queue jobs I would guess
[18:56:48] <bd808>	 "webVideoTranscode: 0 queued; 5878 claimed (4660 active, 1218 abandoned); 0 delayed"
[18:56:59] <Revent>	 bd808: Awesome!
[18:57:17] <bd808>	 that's from "$ mwscript showJobs.php --wiki=commonswiki --group"
[18:57:38] <Revent>	 It’s that ‘0 queued’, in that the pile of huge crap people keept resetting is flushed out. :)
[18:59:02] <Revent>	 Krenair: And yes, I’m qould to has to requeue a couple of thousand, but… not at a rate that will keep the scalers from being able to handle new uploads.
[18:59:12] <Revent>	 *going to have to requeue
[18:59:19] <mutante>	 !log gerrit restarting for config change 308753 - will be back in seconds
[18:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <Krenair>	 yep
[19:00:06] <Krenair>	 thanks Revent
[19:00:33] <Revent>	 Yeah, it’s my life recently. (lol) :P
[19:01:09] <mutante>	 hashar: hi, did you want this one? https://gerrit.wikimedia.org/r/#/c/328051/
[19:02:39] <wikibugs>	 (03PS3) 10Dzahn: toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494)
[19:02:55] <hashar>	 mutante: havent came accross that one yet :(
[19:03:04] <mutante>	 hashar: no rush, later then
[19:03:49] <hashar>	 mutante: it is rather nasty change with a race condition between upstart  and mysql data being on a tmpfs
[19:03:52] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2924104 (10jcrespo) @czar There will be a post mortem, as usual, on https://wikitech.wikime...
[19:06:50] <mutante>	 hashar: yep, not now then. ack
[19:07:21] <ostriches>	 !log gerrit: Started full reindex of all changes, should be background but will be watching
[19:07:22] <wikibugs>	 06Operations, 10Traffic, 10media-storage: Cache and media (images) issues on all Wikimedia wikis - creates problems on upload, display and generation of thumbnails and files - https://phabricator.wikimedia.org/T154780#2924114 (10czar) That's great but I meant that as a largely non-technical user, I didn't kn...
[19:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:44] <hashar>	 mutante: the rough idea is that when a machine boot, the mysql service should only start after a tmpfs has been created at /var/lib/mysql
[19:07:51] <hashar>	 so yeah need a bunch of testing
[19:10:07] <wikibugs>	 (03CR) 10Dzahn: [C: 032] toollabs/CI: give banner scripts an .sh extension [puppet] - 10https://gerrit.wikimedia.org/r/327673 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn)
[19:10:37] <MatmaRex>	 greg-g: hi… i'm gonna need to have something deployed today real quick :(
[19:10:39] <mutante>	 hashar: yea, agreee!
[19:10:41] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack:  Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626
[19:11:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Openstack:  Forward some custom config changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/330626 (owner: 10Andrew Bogott)
[19:12:11] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "That is not going to fix the issue.  The problem is we explicitly mark the mysql service to be started manually on boot, and whenever pupp" [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox)
[19:12:28] <MatmaRex>	 greg-g: https://phabricator.wikimedia.org/T154779 / https://gerrit.wikimedia.org/r/330974
[19:12:30] <wikibugs>	 (03PS4) 10Dzahn: base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952
[19:12:39] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova:  turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838
[19:13:06] <wikibugs>	 (03Abandoned) 10Paladox: Contint: notify service mysql on creation of mysql dir [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) (owner: 10Paladox)
[19:13:21] <wikibugs>	 (03CR) 10Dzahn: "please re-add me when it's ready to go" [puppet] - 10https://gerrit.wikimedia.org/r/326150 (https://phabricator.wikimedia.org/T152640) (owner: 10Paladox)
[19:13:40] <MatmaRex>	 greg-g: we'd have noticed this problem yesterday if the train wasn't delayed :(
[19:15:03] <wikibugs>	 (03CR) 10Dzahn: "please re-add me when it's time for this (needs the next scheduled maintenance window, right)" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[19:16:01] <bd808>	 MatmaRex: looks like the sort of thing that should be ok for a Friday
[19:16:08] <ostriches>	 MatmaRex: +2
[19:16:13] <ostriches>	 Start the backport process
[19:16:22] <MatmaRex>	 thanks folks
[19:17:02] <ebernhardson>	 !log restarting elasticsearch on relforge100[12] to test new search-ltr plugin
[19:17:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952 (owner: 10Dzahn)
[19:18:52] <MatmaRex>	 ostriches: bd808: wmf.7 backport is https://gerrit.wikimedia.org/r/#/c/330975/
[19:19:25] <mutante>	 robh: ^ you will now get "lshw" everywhere (again)
[19:19:27] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 165 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 162, number_of_pending_tasks: 9, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 196, task_max_waiting_in_queue_millis: 1076, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number:
[19:19:27] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 147 threshold =0.1% breach: status: red, number_of_nodes: 2, unassigned_shards: 144, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 214, task_max_waiting_in_queue_millis: 427, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 
[19:19:35] <robh>	 mutante: huzzah!
[19:20:05] <mutante>	 ugh, and right when i said..
[19:20:27] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial
[19:20:27] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 275, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 361, initial
[19:20:32] <jynus>	 ^this is just the restart, right?
[19:20:34] <wikibugs>	 (03PS1) 10Dzahn: Revert "base: add lshw to standard packages" [puppet] - 10https://gerrit.wikimedia.org/r/330976
[19:21:02] <mutante>	 what the heck "E: Unable to locate package lswh
[19:21:11] <mutante>	 but manually it's totally there
[19:21:18] <mutante>	 and installs without any problems
[19:21:37] <icinga-wm>	 PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:21:47] <icinga-wm>	 PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:21:47] <paladox>	 mutante lshw
[19:21:50] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Revert "base: add lshw to standard packages" [puppet] - 10https://gerrit.wikimedia.org/r/330976 (owner: 10Dzahn)
[19:21:54] <mutante>	 sigh
[19:21:57] <icinga-wm>	 PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:21:57] <icinga-wm>	 PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:06] <mutante>	 what a stupid typo
[19:22:13] <paladox>	 lol
[19:22:16] <jynus>	 what is that, BTW?
[19:22:17] <icinga-wm>	 PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh],Etcd_user[root]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on db1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:28] <icinga-wm>	 PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:29] <icinga-wm>	 PROBLEM - puppet last run on hassium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:29] <icinga-wm>	 PROBLEM - puppet last run on bohrium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:30] <icinga-wm>	 PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:30] <icinga-wm>	 PROBLEM - puppet last run on wtp1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:31] <icinga-wm>	 PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:36] <jynus>	 oh, ls hw
[19:22:46] <jynus>	 I got confused with the other command
[19:22:47] <icinga-wm>	 PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:47] <icinga-wm>	 PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:47] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:47] <icinga-wm>	 PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:48] <icinga-wm>	 PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:48] <icinga-wm>	 PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:48] <paladox>	 jynus it was a typo
[19:22:49] <icinga-wm>	 PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:49] <icinga-wm>	 PROBLEM - puppet last run on ms-be1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:22:50] <icinga-wm>	 PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[lswh]
[19:23:05] <mutante>	 jynus: yea, to list hardware details mostly for dc-ops
[19:23:14] <jynus>	 yes, I know it
[19:23:32] <jynus>	 it was the typo one that got me confused
[19:24:05] <mutante>	 yea :/  
[19:24:32] <mutante>	 i'm getting the bot back asap
[19:24:40] <mutante>	 just to avoid the spam
[19:24:55] <wikibugs>	 (03CR) 10Paladox: "probably want to create a follow up with the typo fixed :)" [puppet] - 10https://gerrit.wikimedia.org/r/330976 (owner: 10Dzahn)
[19:25:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Nova:  turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838 (owner: 10Andrew Bogott)
[19:25:26] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova:  turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838
[19:25:53] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "base: add lshw to standard packages"" [puppet] - 10https://gerrit.wikimedia.org/r/330979
[19:26:18] <MatmaRex>	 ostriches: you'll be deploying that, right?
[19:26:39] <ostriches>	 I can ya
[19:26:57] <MatmaRex>	 thanks
[19:27:46] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "base: add lshw to standard packages"" [puppet] - 10https://gerrit.wikimedia.org/r/330979
[19:29:22] <logmsgbot>	 !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/UploadWizard/resources/mw.UploadWizard.js: I32e0b8f81ca2a2e9ffc0c3a379921e12465815f2 (duration: 00m 59s)
[19:29:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:02] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Revert "Revert "base: add lshw to standard packages"" [puppet] - 10https://gerrit.wikimedia.org/r/330979 (owner: 10Dzahn)
[19:30:06] <ostriches>	 MatmaRex: All done
[19:33:25] <MatmaRex>	 ostriches: hmm, i'm still seeing the old code
[19:33:31] <ostriches>	 Cache?
[19:33:51] <MatmaRex>	 probably not
[19:33:52] <ostriches>	 Commit is showing on tin
[19:33:56] <ostriches>	 "Ignore 'bad-prefix' warning on the Upload step"
[19:34:00] <MatmaRex>	 ostriches: hmm, you synced only that one file? it's the wrong file :D
[19:34:06] <ostriches>	 Blah
[19:34:08] <ostriches>	 Whoops
[19:34:15] <MatmaRex>	 resources/mw.UploadWizardUpload.js
[19:34:49] <ostriches>	 Whole dir this time :p
[19:35:14] <logmsgbot>	 !log demon@tin Synchronized php-1.29.0-wmf.7/extensions/UploadWizard/resources: I32e0b8f81ca2a2e9ffc0c3a379921e12465815f2 (duration: 00m 40s)
[19:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:29] <wikibugs>	 (03CR) 10Chad: [C: 031] "Then lets go ahead with this." [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[19:36:38] <wikibugs>	 (03PS4) 10Andrew Bogott: Nova:  turn off ec2 api [puppet] - 10https://gerrit.wikimedia.org/r/330838
[19:37:14] <wikibugs>	 (03CR) 10Dzahn: "i dunno, something about www-data running ssh gives me a bad feeling here. adding Moritz for a second opinion" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4)
[19:38:08] <MatmaRex>	 ostriches: ok, this is ridiculous, but i am still seeing the old code
[19:38:16] <MatmaRex>	 i am looking at https://commons.wikimedia.org/w/extensions/UploadWizard/resources/mw.UploadWizardUpload.js
[19:38:27] <MatmaRex>	 there's no 'bad-prefix' in this file
[19:38:31] <ostriches>	 I see new
[19:38:33] <ostriches>	 Local cache?
[19:38:43] <MatmaRex>	 curl https://commons.wikimedia.org/w/extensions/UploadWizard/resources/mw.UploadWizardUpload.js | grep bad-prefix
[19:39:05] <ostriches>	 					case 'bad-prefix':
[19:39:05] <ostriches>	 						// we ignore these warnings, because the title is not our final title.
[19:39:09] <wikibugs>	 (03CR) 10Dzahn: "the puppet part looks fine, as long as running purgeUnusedProjects.php itself is harmless on the cluster" [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari)
[19:39:10] <ostriches>	 ^ I see it
[19:40:02] <wikibugs>	 (03CR) 10Dzahn: "maybe change the commit message. now it says "enable" but doesn't actually enable." [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[19:40:18] <MatmaRex>	 ostriches: i don't. do you want to see the curl -I?
[19:40:39] <MatmaRex>	 i'm in europe, so i could be seeing something else than you
[19:40:56] <MatmaRex>	 oooh wait, there it is. finally
[19:41:06] <ostriches>	 :)
[19:41:30] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "i think we aren't ready for this yet, as it broke when we tried last time, and meanwhile we have rsynced repos over instead" [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4)
[19:41:33] <wikibugs>	 (03CR) 1020after4: [C: 031] "I agree that it isn't ideal, however, it's required in order for phabricator repository clustering. The reason this is desired is that clu" [puppet] - 10https://gerrit.wikimedia.org/r/324841 (owner: 1020after4)
[19:41:33] <MatmaRex>	 so… we cache .js files for extra five minutes? grumble
[19:41:36] <MatmaRex>	 ostriches: thanks <3
[19:42:32] <ostriches>	 yw
[19:42:41] <wikibugs>	 (03CR) 1020after4: [C: 04-1] "Yeah this is still waiting for me to deploy upstream changes and then it still needs some more thorough testing." [puppet] - 10https://gerrit.wikimedia.org/r/324851 (https://phabricator.wikimedia.org/T137928) (owner: 1020after4)
[19:43:01] <kaldari>	 mutante: I've been periodically running the script manually from terbium, so it's well tested :)
[19:46:08] <mutante>	 kaldari: ok, fair enough :) doing!
[19:46:15] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari)
[19:46:29] <wikibugs>	 (03PS7) 10Dzahn: Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) (owner: 10Kaldari)
[19:47:03] <kaldari>	 mutante: Thanks!
[19:47:38] <kaldari>	 it's also quite fast
[19:48:10] <wikibugs>	 (03PS1) 10Urbanecm: Enable import from cswiki to arbcom_cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330983 (https://phabricator.wikimedia.org/T154799)
[19:49:43] <mutante>	 kaldari: we need to add a line in Hiera to make sure it's only enabled in eqiad, and not both codfw and eqiad
[19:50:05] <mutante>	 eh... or that's what i thought 
[19:50:24] <mutante>	 double checks
[19:54:01] <wikibugs>	 (03PS6) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612
[19:54:20] <wikibugs>	 (03PS7) 10Aaron Schulz: Add DB "shard" column to logstash log entries for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330612
[19:54:51] <wikibugs>	 (03PS1) 10Dzahn: mediawiki: activate page assessments cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/330984
[19:54:59] <mutante>	 kaldari: ^ it was missing that 
[19:55:48] <wikibugs>	 (03PS2) 10Dzahn: mediawiki: activate page assessments cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/330984
[19:56:24] <wikibugs>	 (03CR) 10Dzahn: [C: 032] mediawiki: activate page assessments cron in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/330984 (owner: 10Dzahn)
[19:56:55] <kaldari>	 Thanks!
[19:56:57] <icinga-wm>	 PROBLEM - nova-api http on labnet1002 is CRITICAL: connect to address 10.64.20.25 and port 8774: Connection refused
[19:59:47] <icinga-wm>	 RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[19:59:52] <mutante>	 kaldari: done, exists on terbium now (but not on wasat)
[20:00:41] <kaldari>	 mutante: what's wasat?
[20:00:53] <wikibugs>	 (03PS5) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363
[20:00:55] <wikibugs>	 (03PS1) 10Chad: Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986
[20:01:01] <mutante>	 kaldari: the maintenance server in codfw. if the datacenters switch over that replaces terbium
[20:01:15] <mutante>	 in that case we'll be able to just make the switch in Hiera
[20:01:28] <mutante>	 that will deactivate the cron in eqiad, and activate it in codfw
[20:01:39] <kaldari>	 got it
[20:01:42] <mutante>	 (same with all the other maint crons)
[20:02:26] <AaronSchulz>	 Krinkle: https://gerrit.wikimedia.org/r/#/c/330612/
[20:02:35] <wikibugs>	 (03PS3) 10Chad: MWMultiversion cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309366
[20:02:42] <wikibugs>	 (03PS3) 10Thcipriani: Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784)
[20:03:47] <icinga-wm>	 PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[20:04:25] <mutante>	 the puppet spam because the lshw typo is over.   lunch break
[20:04:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani)
[20:06:45] <wikibugs>	 (03CR) 10Chad: [C: 032] Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad)
[20:07:16] <wikibugs>	 (03Merged) 10jenkins-bot: Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad)
[20:07:37] <wikibugs>	 (03CR) 10jenkins-bot: Remove w/MWVersion.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad)
[20:09:07] <icinga-wm>	 PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused
[20:09:43] <wikibugs>	 (03PS4) 10Thcipriani: Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784)
[20:11:07] <icinga-wm>	 RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.003 second response time
[20:12:25] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: Fix syntax mistake in nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/330996
[20:12:27] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "The changes to runphpscriptlet lgtm though I have not tested them. It's a script I recently refactored away from but don't want to toss ju" [puppet] - 10https://gerrit.wikimedia.org/r/309366 (owner: 10Chad)
[20:14:18] <logmsgbot>	 !log demon@tin Synchronized w: Dropping old entry point (duration: 00m 41s)
[20:14:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Nova: Fix syntax mistake in nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/330996 (owner: 10Andrew Bogott)
[20:25:27] <wikibugs>	 (03CR) 1020after4: [C: 031] Include hhvm fatals and exceptions in scap canary checks [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani)
[20:25:57] <icinga-wm>	 RECOVERY - nova-api http on labnet1002 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.003 second response time
[20:33:51] <wikibugs>	 (03PS1) 10Chad: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998
[20:34:09] <wikibugs>	 (03PS2) 10Chad: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998
[20:37:19] <Revent>	 https://commons.wikimedia.org/wiki/File:Ka%C5%A1elj_-_Rudnik_via_Orle.webm <- this is why the video sclers were breaking, lol (an example)
[20:39:57] <icinga-wm>	 PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server]
[20:42:11] <wikibugs>	 06Operations, 10Traffic: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#2924351 (10ema)
[20:42:16] <wikibugs>	 06Operations, 10Traffic: Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801#2924367 (10ema) p:05Triage>03Normal
[20:45:27] <icinga-wm>	 RECOVERY - nova-api http on labtestnet2001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.083 second response time
[20:46:20] <wikibugs>	 (03PS4) 10Tim Landscheidt: mwyaml: Accept existing, but empty "Hiera:" pages as well [puppet] - 10https://gerrit.wikimedia.org/r/325131 (https://phabricator.wikimedia.org/T152142)
[20:46:35] <wikibugs>	 (03CR) 1020after4: [C: 031] Deploy scholarships with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/326461 (https://phabricator.wikimedia.org/T129134) (owner: 10Niharika29)
[21:03:47] <icinga-wm>	 RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[21:07:57] <icinga-wm>	 RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[21:28:21] <wikibugs>	 (03PS1) 10Andrew Bogott: Shinkengen: Get project hosts from openstack and not from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/331005 (https://phabricator.wikimedia.org/T108625)
[21:29:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Shinkengen: Get project hosts from openstack and not from ldap. [puppet] - 10https://gerrit.wikimedia.org/r/331005 (https://phabricator.wikimedia.org/T108625) (owner: 10Andrew Bogott)
[22:09:57] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:12:00] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad)
[22:13:36] <wikibugs>	 (03CR) 10Krinkle: "Perhaps keep MWVersion.php/getMediaWiki as one-line wrapper for at least one commit so that we can do a full Git search for any mentions a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad)
[22:15:49] <mutante>	 hey, ever tried changing the theme in gerrit diff ?
[22:16:18] <wikibugs>	 (03CR) 10Krinkle: "Looks like this is included from files like wiki.phtml using a relative ./ include, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad)
[22:16:26] <mutante>	 while looking at a diff in gerrit, upper right corner, gear icon, "Diff preferences"
[22:17:03] <wikibugs>	 (03CR) 10Krinkle: "nvm, git search has an outdated index :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330986 (owner: 10Chad)
[22:19:16] <wikibugs>	 (03CR) 10Krinkle: [C: 031] docroots: Swap wikidata for wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/330709 (owner: 10Chad)
[22:20:19] <wikibugs>	 (03PS2) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717)
[22:25:15] <wikibugs>	 (03PS3) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717)
[22:27:49] <wikibugs>	 (03CR) 10Umherirrender: [C: 031] "the only one to change at the moment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328482 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy)
[22:28:09] <wikibugs>	 (03PS4) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717)
[22:30:12] <wikibugs>	 (03PS5) 10Dzahn: tendril: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717)
[22:35:43] <ostriches>	 Krinkle: You're right re: leaving a stub for a bit in MWVersion before deleting.
[22:35:48] <ostriches>	 Don't think I'll rush it out on a friday though
[22:36:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "I have not checked the code, but ok with the idea. Please double check authentication after apache changes." [puppet] - 10https://gerrit.wikimedia.org/r/330829 (https://phabricator.wikimedia.org/T133717) (owner: 10Dzahn)
[22:36:13] <wikibugs>	 (03CR) 10Chad: [C: 032] Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad)
[22:36:50] <wikibugs>	 (03Merged) 10jenkins-bot: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad)
[22:37:52] <wikibugs>	 (03CR) 10jenkins-bot: Fold updateBranchPointers into updateWikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/330998 (owner: 10Chad)
[22:38:19] <logmsgbot>	 !log demon@tin Synchronized multiversion: updateBranchPointers consolidation (duration: 00m 56s)
[22:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:38:57] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[22:42:16] <wikibugs>	 (03PS6) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363
[22:42:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "swift: move thumbs handling to codfw temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/331084
[22:42:31] <wikibugs>	 (03PS1) 10Dzahn: ganglia: use Letsencrypt for SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/331085 (https://phabricator.wikimedia.org/T133717)
[22:43:08] <wikibugs>	 (03PS7) 10Paladox: Gerrit: Add support for logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[22:43:16] <wikibugs>	 (03PS8) 10Paladox: Gerrit: Add support for logstash in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324)
[22:43:27] <wikibugs>	 (03CR) 10Paladox: "> maybe change the commit message. now it says "enable" but doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/330832 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox)
[22:43:47] <wikibugs>	 (03CR) 10Chad: "PS6 keeps MWVersion as a light wrapper for now so we can do a last check :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad)
[22:44:09] <wikibugs>	 (03CR) 10Paladox: "Ok. :)" [puppet] - 10https://gerrit.wikimedia.org/r/330455 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox)
[22:46:07] <wikibugs>	 06Operations, 10DBA, 10Gerrit, 13Patch-For-Review, 07Upstream: Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#2924787 (10Paladox) @jcrespo or @Marostegui Hi, i have tested this https://gerrit.wikimedia.org/r/#/c/330455/ locally. Could you...
[22:54:04] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2924820 (10Deskana)
[22:54:06] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2924819 (10Deskana) 05Open>03Resolved
[22:56:45] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2344551 (10Paladox) Was the elastic search plugin updated on the install? Might have been because...
[22:57:36] <wikibugs>	 (03PS6) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947)
[22:57:43] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:06:13] <wikibugs>	 (03CR) 10BBlack: [C: 031] Revert "swift: move thumbs handling to codfw temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/331084 (owner: 10Filippo Giunchedi)
[23:07:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Revert "swift: move thumbs handling to codfw temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/331084 (owner: 10Filippo Giunchedi)
[23:08:11] <godog>	 !log force puppet run on cache_upload in eqiad to switch thumbs back from codfw
[23:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:37] <odder>	 Hi guys, quick question.
[23:23:07] <odder>	 And ideas as to who's responsible for handling domain transfers on the WMF side?
[23:23:32] <odder>	 Also who's responsible for handling that on MarkMonitor side? Is it still Doni Daggett?
[23:24:48] <wikibugs>	 (03CR) 10Hashar: "Note that the experimental job runs on a Trusty machine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:25:30] <odder>	 mutante: any ideas?
[23:25:44] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:26:13] <hashar>	 odder: best bet is to fill a task for  #operations and #dns ? :)
[23:26:39] <wikibugs>	 (03CR) 10Krinkle: "It's on nodepool/jessie now. submodules are expanded. vendor non-dev is preserved. Uses same fetch-composer-dev logic as we do for mediawi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:27:37] <odder>	 hashar: sure, just managed to get it from a squatter so \o/! victory!
[23:29:21] <wikibugs>	 06Operations, 10DNS, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2924998 (10tomasz)
[23:29:23] <wikibugs>	 (03CR) 10Hashar: [C: 031] "That is quite nice!   Well done Timo :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:29:26] <odder>	 https://phabricator.wikimedia.org/T154826 hashar
[23:29:59] <wikibugs>	 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2925024 (10hashar)
[23:30:00] <hashar>	 odder: I cant remember all the details
[23:30:38] <hashar>	 odder: but I think WMF avoid buying every possible domains because it has a cost (the registration fee) and is a lot of operational work (dns entries, web park/redirect) etc
[23:30:44] <odder>	 It used to be Yana / Doni Daggett from MarkMonitor when I last bought & donated a domain to the WMF.
[23:30:59] <odder>	 hashar: it's a TLD that got squatted in 200x.
[23:31:21] <odder>	 I have been meaning to buy for years now; they only forgot to renew it this year.
[23:31:24] <hashar>	 I am just mumbling hearsays :D
[23:31:50] <hashar>	 ops or legal would be able to tell eventually
[23:32:10] <odder>	 Well I did pay for it for the first year so that's done :)
[23:32:23] <hashar>	 \O/
[23:32:49] <wikibugs>	 (03PS1) 10Krinkle: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091
[23:34:06] <hashar>	 odder: there is wikipedia.pl as well  registered to Stowarzyszenie Wikimedia Polska.  Maybe that is the local chapter
[23:35:15] <odder>	 They are the local chapter, however they do not wish to buy domain names for WMF trademarks which are not the Polish names of their respective projects.
[23:35:39] <odder>	 So Wiktionary is called "Wikisłownik" in Polish and they didn't want to buy the English trademark as it's not a Polish name of the project.
[23:36:04] <odder>	 Same with Wikiquote.pl which I also snatched from a squatter and donated to WMF in 2015 (?) I think.
[23:36:16] <hashar>	 that kind of make sense yeah
[23:36:29] <odder>	 We're almost there, just one squatted domain to go.
[23:36:44] <odder>	 I've been on a mission to catch them all for a while :)
[23:37:02] <hashar>	 when you get the last polish domain
[23:37:14] <hashar>	 I guess we should have a party :D
[23:37:20] <odder>	 Expires 2017-04-23, but they're damn professionals, so we'll see.
[23:37:34] <odder>	 Or maybe I'll just punch them in the face when I walk past their offices, we'll see.
[23:37:53] <odder>	 they've held that one since 2008 :-(
[23:38:08] <hashar>	 well they will disappear long before Wikimedia does :]
[23:38:13] <hashar>	 time is on our side!
[23:38:31] <hashar>	 on those good words, I am heading to bed.   Have a good week-end!
[23:38:40] <odder>	 You too; thanks!
[23:38:46] <wikibugs>	 (03CR) 10Jcrespo: "I had some of this already implemented; in particular the json, which unlike PHP, has not item order for hashes, so the master requires ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091 (owner: 10Krinkle)
[23:39:11] <wikibugs>	 (03PS2) 10Krinkle: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091
[23:40:45] <wikibugs>	 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2924998 (10CRoslof) Send me an email and I can help you get the process started: croslof@wikimedia.org  Please note that a domain generally cannot be transferred until 60 days after...
[23:41:09] <wikibugs>	 06Operations, 10DNS, 10Domains, 10Traffic: Donate wiktionary.pl to the Foundation - https://phabricator.wikimedia.org/T154826#2925039 (10CRoslof) a:03CRoslof
[23:45:00] <wikibugs>	 (03PS3) 10Krinkle: noc: Implement noc.wikimedia.org/db.php?format=json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331091
[23:45:57] <wikibugs>	 (03PS7) 10Krinkle: build: require-dev phpunit in composer.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/325055 (https://phabricator.wikimedia.org/T85947)
[23:48:27] <wikibugs>	 (03PS1) 10Krinkle: build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947)
[23:48:36] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:49:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:49:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] build: Update PHPUnit from 3.7 to 4.8, add phplint to composer-test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/331093 (https://phabricator.wikimedia.org/T85947) (owner: 10Krinkle)
[23:57:27] <icinga-wm>	 PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues