[00:00:39] <grrrit-wm>	 (03CR) 10Paladox: "Applied at http://gerrit-test.wmflabs.org/gerrit/#/c/32/2/tests/fixtures/layout-cloner.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/301027 (owner: 10Paladox)
[00:06:23] <Pchelolo>	 !log restbase deploy ae5fbac to staging
[00:06:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:11:48] <grrrit-wm>	 (03PS2) 10Paladox: gerrit: Fix the css for inline diff [puppet] - 10https://gerrit.wikimedia.org/r/301027 
[00:23:21] <wikibugs>	 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2494394 (10aaron) 05Open>03Resolved
[00:26:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "yes thanks! i noticed that wrapping issue too. that should be https://bugs.chromium.org/p/gerrit/issues/detail?id=4292" [puppet] - 10https://gerrit.wikimedia.org/r/301027 (owner: 10Paladox)
[00:26:41] <ostriches>	 mutante: We still gotta merge my rsync thingie.
[00:26:51] <ostriches>	 If we want that to apply too :)
[00:28:05] <paladox>	 :)
[00:28:16] <grrrit-wm>	 (03CR) 10Paladox: "thanks." [puppet] - 10https://gerrit.wikimedia.org/r/301027 (owner: 10Paladox)
[00:28:29] <grrrit-wm>	 (03PS3) 10Dzahn: Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 (owner: 10Chad)
[00:28:45] <mutante>	 ostriches: yes, i meant to do that earlier, just got back later
[00:28:51] <ostriches>	 k :)
[00:29:05] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 (owner: 10Chad)
[00:30:43] <mutante>	 ok, expecting a recovery and line wraps in a moment
[00:30:49] <paladox>	 thankyou
[00:30:50] <paladox>	 :)
[00:31:35] <mutante>	 it just removed the ferm rule for rsync
[00:31:47] <mutante>	 here is the example link we had
[00:31:49] <mutante>	 			} elseif ( $action['type'] == 'unknown-signed-addition' ) {
[00:31:55] <mutante>	 eh, wrong paste
[00:32:02] <mutante>	 https://gerrit.wikimedia.org/r/#/c/301033/2/tests/fixtures/layout-cloner.yaml
[00:32:14] <mutante>	 this shows how it wraps now instead of cutting the line off
[00:32:35] <paladox>	 :)
[00:32:56] <grrrit-wm>	 (03PS1) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 
[00:33:32] <mutante>	 paladox: thanks
[00:33:42] <paladox>	 your welcome :)
[00:34:03] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (owner: 10Yuvipanda)
[00:34:21] <grrrit-wm>	 (03PS2) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 
[00:36:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (owner: 10Yuvipanda)
[00:37:06] <mutante>	 urandom: do you want 2008-c enabled now
[00:37:15] <mutante>	 i might be too late
[00:37:33] <grrrit-wm>	 (03PS4) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) 
[00:38:03] <urandom>	 mutante: uh, yeah, if you want
[00:38:31] <grrrit-wm>	 (03PS2) 10Dzahn: Enable Cassandra instance restbase2008-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300942 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[00:38:52] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase2008-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300942 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[00:39:20] <grrrit-wm>	 (03PS3) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 
[00:40:29] <mutante>	 urandom: you can go ahead now
[00:40:40] <urandom>	 mutante: great, thanks!
[00:41:00] <mutante>	 yw
[00:44:43] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1125.73 seconds
[00:45:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 808.86 seconds
[00:45:15] <grrrit-wm>	 (03CR) 10Dzahn: "you can just allow "/bin/journalctl *" like we do in other admin groups. the wildcard in the middle of it is not going to mean it's limite" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey)
[00:45:17] <grrrit-wm>	 (03PS5) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) 
[00:48:58] <urandom>	 !log T134016: Bootstrapping restbase2008-c.codfw.wmnet
[00:49:00] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[00:49:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:49:09] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.32 seconds
[00:49:59] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 605 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5090989 keys - replication_delay is 605
[00:52:54] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1036 is OK: OK slave_sql_lag Replication lag: 0.31 seconds
[00:53:37] <mutante>	 !log lead - stopped rsyncd 
[00:53:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:55:29] <bawolff>	 bblack: So in regards to https://gerrit.wikimedia.org/r/#/c/296634/ - I'm not really sure what the procedures are for getting changes to varnish deployed. Should I sign that patch up for puppet swap? or do something else?
[00:58:36] <icinga-wm>	 ACKNOWLEDGEMENT - swift-account-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor daniel_zahn known maintenance
[00:58:36] <icinga-wm>	 ACKNOWLEDGEMENT - swift-account-reaper on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper daniel_zahn known maintenance
[00:58:36] <icinga-wm>	 ACKNOWLEDGEMENT - swift-account-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator daniel_zahn known maintenance
[00:58:36] <icinga-wm>	 ACKNOWLEDGEMENT - swift-account-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server daniel_zahn known maintenance
[00:58:36] <icinga-wm>	 ACKNOWLEDGEMENT - swift-container-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor daniel_zahn known maintenance
[00:58:44] <mutante>	 oops, not intended to kill it
[00:58:55] <mutante>	 but will be back 
[01:00:19] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5087128 keys - replication_delay is 0
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - Check size of conntrack table on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - NTP on ganeti1004 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - configured eth on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:24] <icinga-wm>	 ACKNOWLEDGEMENT - dhclient process on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:25] <icinga-wm>	 ACKNOWLEDGEMENT - ganeti-confd running on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:25] <icinga-wm>	 ACKNOWLEDGEMENT - ganeti-mond running on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:26] <icinga-wm>	 ACKNOWLEDGEMENT - ganeti-noded running on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:26] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:03:27] <icinga-wm>	 ACKNOWLEDGEMENT - salt-minion processes on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414
[01:04:22] <wikibugs>	 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2399490 (10Dzahn) ganeti1004 showed up in Icinga. expired downtime. ACKed
[01:05:21] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) 
[01:06:24] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) 
[01:07:28] <grrrit-wm>	 (03CR) 10Chad: [C: 031] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[01:08:10] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: Connection refused
[01:10:25] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) 
[01:10:51] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: Connection refused daniel_zahn bootstrapping T134016
[01:11:46] <Amir1>	 !log deploying from 2d9817b to a291da1 for ores in scb nodes
[01:11:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:14:05] <Amir1>	 deployed in canary
[01:14:08] <Amir1>	 it was okay
[01:21:03] <Amir1>	 okay, Everything was fine and it's still fine
[01:21:07] <Amir1>	 I call it a victory
[01:21:47] <mutante>	 i heard the new thing is that there is swift in labs
[01:21:52] <mutante>	 oops, wrong channel
[01:22:01] <mutante>	 Amir1: :)
[01:22:21] <Amir1>	 :)
[01:23:39] <Amir1>	 hey mutante, we have this new dashboard on jobs the ORES extension makes to the service, and how many of them fails: https://grafana.wikimedia.org/dashboard/db/ores-extension I thought you might find this interesting :)
[01:24:20] <Amir1>	 (zoom out, you will see funny things :D)
[01:28:48] <mutante>	 Amir1: that doesnt look so bad. if i zoom out far enough just a spike in the beginning
[01:28:58] <mutante>	 4% failure rate recently ?
[01:29:27] <Amir1>	 it's 1% since the last week and deployment of a new config for wikidata
[01:29:40] <grrrit-wm>	 (03PS4) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) 
[01:29:52] <mutante>	 Amir1: :)
[01:29:53] <Amir1>	 the first three spikes were when the model had issues with Dutch Wikipedia
[01:30:01] <Amir1>	 400 errors per minute
[01:30:53] <grrrit-wm>	 (03PS5) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) 
[01:31:22] <Amir1>	 the funny thing is that ores extension retries failed jobs 30 times, in some cases ores never give score (e.g. edit in talk pages in wikidata) I was thinking ORES is an AI so it'll get upset and throws some scores after 20th time :D
[01:33:33] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[01:33:55] <mutante>	 hehe, gotta make it learn about wikidata talk pages
[01:36:29] <icinga-wm>	 PROBLEM - restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.147, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused)))
[01:37:00] <icinga-wm>	 PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused
[01:38:59] <mutante>	 ^ ignoring that because it's called a test host
[01:39:03] <mutante>	 shrug
[01:39:39] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: Connection refused
[01:40:28] <mutante>	 i think a test host should probably not be in monitoring
[01:41:23] <grrrit-wm>	 (03PS4) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (https://phabricator.wikimedia.org/T114063) 
[01:41:29] <grrrit-wm>	 (03PS5) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (https://phabricator.wikimedia.org/T114063) 
[01:41:42] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] "Looks good to me." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio)
[01:42:40] <grrrit-wm>	 (03PS1) 10Dzahn: admin: add bcohn to analytics-privatedata, researchers [puppet] - 10https://gerrit.wikimedia.org/r/301045 
[01:43:56] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[01:44:11] <grrrit-wm>	 (03CR) 10Yuvipanda: "Killed with https://gerrit.wikimedia.org/r/#/c/301036/" [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad)
[01:44:59] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[01:45:09] <grrrit-wm>	 (03PS6) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) 
[01:45:10] <icinga-wm>	 PROBLEM - cassandra-a service on cerium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[01:45:14] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[01:49:16] <grrrit-wm>	 (03PS2) 10Dzahn: admin: add bcohn to analytics-privatedata, researchers [puppet] - 10https://gerrit.wikimedia.org/r/301045 (https://phabricator.wikimedia.org/T140449) 
[01:49:37] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "the other 2 users from the same request are already done. same batch, just completing this" [puppet] - 10https://gerrit.wikimedia.org/r/301045 (https://phabricator.wikimedia.org/T140449) (owner: 10Dzahn)
[01:50:22] <grrrit-wm>	 (03PS3) 10Dzahn: admin: add bcohn to analytics-privatedata, researchers [puppet] - 10https://gerrit.wikimedia.org/r/301045 (https://phabricator.wikimedia.org/T140449) 
[01:50:32] <mutante>	 sniped again
[01:50:49] <icinga-wm>	 RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy
[01:51:20] <icinga-wm>	 RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.014 second response time
[01:51:20] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494534 (10Dzahn) a:03Dzahn
[01:51:37] <mutante>	 !log cerium testing is over?
[01:51:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:53:37] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2494539 (10Dzahn) a:03Dzahn
[01:53:39] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2494540 (10Dzahn) a:05Jgreen>03Dzahn
[01:53:41] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups  'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2494542 (10Dzahn) a:05Jgreen>03Dzahn
[01:55:30] <icinga-wm>	 PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 4 failures
[01:55:40] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494547 (10Dzahn)
[01:55:42] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2494545 (10Dzahn) 05Open>03Resolved user has been created on bastions, stat1002 (and elsewhere wher...
[01:55:50] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2494548 (10Dzahn)
[01:56:50] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Dzahn)
[01:56:52] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups  'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2494549 (10Dzahn) 05Open>03Resolved user has been created on bastion hosts, stat1002 and other places where the group is u...
[02:00:45] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494555 (10Dzahn)
[02:00:47] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2494553 (10Dzahn) 05Open>03Resolved user has been created on bastion hosts, stat1002 and other places where t...
[02:02:06] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Dzahn) 05Open>03Resolved all 3 users have been created now. details in subtasks
[02:02:31] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Setup ldapvi + make a new role! [puppet] - 10https://gerrit.wikimedia.org/r/301046 
[02:03:49] <icinga-wm>	 RECOVERY - cassandra-a service on cerium is OK: OK - cassandra-a is active
[02:04:00] <icinga-wm>	 RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:04:20] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.16.153:9042 on cerium is OK: TCP OK - 0.005 second response time on port 9042
[02:05:51] <wikibugs>	 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494563 (10Dzahn) @ellery     all 3 users have been created now. please follow-up with them so they get to the data they need ( --> T140399#2483375)
[02:10:10] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201)
[02:12:10] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy
[02:17:27] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 
[02:19:04] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (owner: 10Yuvipanda)
[02:19:09] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Setup ldapvi + make a new role! [puppet] - 10https://gerrit.wikimedia.org/r/301046 
[02:19:16] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Setup ldapvi + make a new role! [puppet] - 10https://gerrit.wikimedia.org/r/301046 (owner: 10Yuvipanda)
[02:21:02] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 
[02:23:44] <thcipriani>	 jouncebot: next
[02:23:45] <jouncebot>	 In 12 hour(s) and 36 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1500)
[02:24:04] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 (owner: 10Yuvipanda)
[02:24:20] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 
[02:24:22] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 
[02:24:40] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail
[02:25:35] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 09m 00s)
[02:25:36] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 (owner: 10Yuvipanda)
[02:25:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:25:58] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (owner: 10Yuvipanda)
[02:26:15] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 
[02:26:19] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 
[02:27:07] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 (owner: 10Yuvipanda)
[02:30:50] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:31:44] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jul 26 02:31:43 UTC 2016 (duration 6m 8s)
[02:31:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:32:58] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Fixup for ldapvi.conf [puppet] - 10https://gerrit.wikimedia.org/r/301050 
[02:36:35] <grrrit-wm>	 (03PS4) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 
[02:42:11] <grrrit-wm>	 (03PS5) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 
[02:42:13] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Fixup for ldapvi.conf [puppet] - 10https://gerrit.wikimedia.org/r/301050 
[02:42:15] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 
[02:42:57] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] ldap: Fixup for ldapvi.conf [puppet] - 10https://gerrit.wikimedia.org/r/301050 (owner: 10Yuvipanda)
[02:47:11] <icinga-wm>	 PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail
[02:50:59] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) 
[02:51:02] <grrrit-wm>	 (03PS6) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) 
[02:52:55] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) 
[03:10:30] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 22 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[03:15:20] <icinga-wm>	 RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[03:16:30] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 17 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[03:26:49] <grrrit-wm>	 (03PS1) 10Yuvipanda: WIP replacement of modify-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/301058 
[03:31:19] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) 
[03:39:25] <icinga-wm>	 PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24
[03:41:15] <icinga-wm>	 RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
[03:43:14] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201)
[03:47:14] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy
[03:57:37] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) 
[04:27:36] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[04:32:52] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[04:34:01] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[04:37:51] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[04:51:28] <grrrit-wm>	 (03PS1) 10Tim Starling: Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 
[04:52:36] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 (owner: 10Tim Starling)
[04:53:52] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201)
[05:00:01] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy
[05:13:34] <elukey>	 mmmm kafka2002 looks weird
[05:17:27] <elukey>	 ah issues with kafka 0.9 after the upgrade
[05:19:37] <elukey>	 very weird, will keep an eye on it
[05:21:32] <grrrit-wm>	 (03PS2) 10Tim Starling: Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 
[05:30:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[05:36:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[05:45:29] <wikibugs>	 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2494691 (10Dzahn)
[05:46:01] <wikibugs>	 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn)
[05:46:03] <wikibugs>	 06Operations: reinstall maps-test200[1234] with RAID - https://phabricator.wikimedia.org/T140440#2494692 (10Dzahn) 05Open>03Invalid oh, thanks @akosiaris for all the details
[05:47:50] <wikibugs>	 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2494695 (10Dzahn) @dschwen Is it fixed now?
[05:54:06] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2494696 (10Dzahn) @andrew @yuvipanda as list admins of labs-l and labs-announce, how about  the remaining checkbox " Add as mod to labs-l/labs-announce"...
[05:57:13] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] Add Html5Depurate module and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301062 (owner: 10Tim Starling)
[06:10:49] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[06:13:44] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: replace gerrit server (ytterbium) with jessie server (lead) - https://phabricator.wikimedia.org/T125018#2494697 (10Dzahn) decom:  https://gerrit.wikimedia.org/r/#/c/300806/ https://gerrit.wikimedia.org/r/#/c/300812/   01:36 ostriches:...
[06:16:49] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[06:18:08] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2494702 (10Dzahn) Alright, found the shell name associated with the wikitech user is "jk" using ldapsearch. Added jk to the LDAP group called "nda".  ldapsear...
[06:20:20] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2494703 (10Dzahn) 05Open>03Resolved the grafana-admin part of this request should be resolved now.  i am not sure about the WebRequestLogs , @gehel do you...
[06:20:51] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2494705 (10Dzahn) 05Resolved>03Open
[06:32:20] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:35:10] <icinga-wm>	 PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:37:20] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201)
[06:37:58] <_joe_>	 elukey: you're already on it right?
[06:38:13] <_joe_>	 (eventbus)
[06:39:20] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy
[06:42:34] <wikibugs>	 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2494722 (10Joe)
[06:42:36] <elukey>	 hey _joe_ sorry I was commuting, will double check in a bit
[06:42:50] <_joe_>	 elukey: it recovered in the meantime
[06:42:51] <elukey>	 I am sure it is due to the kafka 0.9 migration
[06:43:00] <elukey>	 it seems flapping from this morning :/
[06:51:01] <elukey>	 so kafka seems ok
[06:51:23] <elukey>	 but eventbus has some issues pushing messages to it
[06:51:41] <elukey>	 that are only test events since this is main-codfw
[06:52:00] <elukey>	 so might be related to the python kafka client that doesn't like kafka 0.9
[06:52:18] <elukey>	 ah yes:  Test Produce a valid test event returned the unexpected status 500 (expecting: 201)
[06:52:19] <wikibugs>	 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2494729 (10Joe)
[06:56:30] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[06:57:09] <icinga-wm>	 RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[07:12:28] <grrrit-wm>	 (03PS3) 10Tim Starling: Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 
[07:12:48] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: kafka_config: sort arrays as well [puppet] - 10https://gerrit.wikimedia.org/r/301070 (https://phabricator.wikimedia.org/T141242) 
[07:12:50] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) 
[07:14:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[07:16:08] <grrrit-wm>	 (03CR) 10Tim Starling: Add Html5Depurate module and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301062 (owner: 10Tim Starling)
[07:20:04] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2494769 (10MoritzMuehlenhoff) >>! In T140422#2493224, @madhuvishy wrote: > @MoritzMuehlenhoff Done! http://keys.gnupg.net/pks/lookup?op=get&search=0xA4D...
[07:38:46] <kart_>	 !log Update cxserver to 447a6c9
[07:38:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:48:55] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10Jan_Dittrich)
[07:50:09] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[07:52:09] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 031] "Looks fine now" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn)
[07:53:28] <grrrit-wm>	 (03CR) 10Muehlenhoff: "But maybe drop the notes on labtest and the corp mirror which have been clarified on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn)
[07:56:09] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:02:07] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10elukey) Hi Jan,  adding https://wikitech.wikimedia.org/wiki/Analytics/Data_access to this task as reference but you have probably already seen it. As far as I can...
[08:08:02] <icinga-wm>	 PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 155574 MB (3% inode=99%)
[08:09:54] <grrrit-wm>	 (03PS1) 10Gilles: Update Thumbor configuration for python-thumbor-wikimedia 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) 
[08:10:32] <grrrit-wm>	 (03CR) 10Gilles: [C: 04-1] "python-thumbor-wikimedia 1.0.5 needs to be packaged and uploaded to our repo first" [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles)
[08:11:26] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2494904 (10Gilles)
[08:11:48] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2437365 (10Gilles)
[08:14:24] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2494922 (10Gilles)
[08:15:42] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:16:59] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2494926 (10MoritzMuehlenhoff) >>! In T141277#2493220, @chasemp wrote: > @andrew and I were discussing whether LVS would make sense in front of LDAP with the ability to more in...
[08:21:41] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492736 (10Joe) If what happened the other day was that one VM was overloaded and stopped answering ldap queries while still accepting connections (which given the described p...
[08:21:42] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[08:22:01] <icinga-wm>	 RECOVERY - Disk space on fluorine is OK: DISK OK
[08:25:30] <grrrit-wm>	 (03CR) 10Muehlenhoff: Minor tweaks to 2.12.2 package (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad)
[08:38:14] <grrrit-wm>	 (03CR) 10Hashar: remove ytterbium from puppet, update gerrit comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300806 (owner: 10Dzahn)
[08:43:02] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) 
[08:44:07] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] kafka_config: sort arrays as well [puppet] - 10https://gerrit.wikimedia.org/r/301070 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[08:44:24] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[08:48:52] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis)
[08:54:39] <grrrit-wm>	 (03PS1) 10Jcrespo: Delete coredb_mysql module [puppet] - 10https://gerrit.wikimedia.org/r/301076 
[08:54:42] <wikibugs>	 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2494999 (10dschwen) 05Open>03Resolved Yes! Many thanks!
[08:57:29] <grrrit-wm>	 (03CR) 10Jcrespo: "This is technical debt; this module had been deprecated by the mariadb one, but some hosts continued using it until last week." [puppet] - 10https://gerrit.wikimedia.org/r/301076 (owner: 10Jcrespo)
[08:57:40] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Here you are modifying the default file we use via" [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis)
[09:04:12] <grrrit-wm>	 (03PS3) 10Jcrespo: Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) 
[09:05:55] <grrrit-wm>	 (03PS4) 10Jcrespo: Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) 
[09:08:33] <grrrit-wm>	 (03PS5) 10Jcrespo: Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) 
[09:10:33] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 04-2] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo)
[09:10:43] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo)
[09:10:54] <jynus>	 ^I clicked the wrong button
[09:11:11] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail
[09:11:32] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495057 (10fgiunchedi) @elukey sounds great, thanks! mw1170 and mw1171 would do I think?
[09:12:22] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: rsyslog: temporarily lower centralserver retention [puppet] - 10https://gerrit.wikimedia.org/r/300833 (https://phabricator.wikimedia.org/T139612) 
[09:18:39] <moritzm>	 !log updating debhelper, cdbs, devscripts, libintl-perl, libmodule-build-perl and libnet-dns-perl on jessie systems for compatibility with perl security update
[09:18:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:21:50] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] rsyslog: temporarily lower centralserver retention [puppet] - 10https://gerrit.wikimedia.org/r/300833 (https://phabricator.wikimedia.org/T139612) (owner: 10Filippo Giunchedi)
[09:29:41] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) 
[09:30:18] <grrrit-wm>	 (03PS1) 10Jcrespo: Add provisional my.cnf for new labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/301081 
[09:31:04] <grrrit-wm>	 (03PS2) 10Jcrespo: Add provisional my.cnf for new labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/301081 
[09:32:20] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add provisional my.cnf for new labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/301081 (owner: 10Jcrespo)
[09:33:21] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:39:04] <grrrit-wm>	 (03PS1) 10Jcrespo: Fix template references for replica db config [puppet] - 10https://gerrit.wikimedia.org/r/301082 
[09:39:15] <grrrit-wm>	 (03PS2) 10Jcrespo: Fix template references for replica db config [puppet] - 10https://gerrit.wikimedia.org/r/301082 
[09:39:42] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[09:39:42] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[09:39:52] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[09:39:52] <icinga-wm>	 RECOVERY - swift-object-server on ms-be3001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[09:40:04] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495090 (10Addshore) >>! In T141339#2494880, @elukey wrote: > Hi Jan, >  > adding https://wikitech.wikimedia.org/wiki/Analytics/Data_access to this task as reference but you...
[09:40:13] <icinga-wm>	 RECOVERY - swift-container-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[09:40:23] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[09:40:53] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[09:41:02] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[09:41:03] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be3001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[09:41:23] <icinga-wm>	 RECOVERY - swift-account-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[09:41:23] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[09:41:26] <elukey>	 hello ms-be3001
[09:42:08] <grrrit-wm>	 (03PS1) 10Elukey: Add the permissions_validity_in_ms among the configurable parameters [puppet] - 10https://gerrit.wikimedia.org/r/301083 (https://phabricator.wikimedia.org/T140869) 
[09:42:16] <godog>	 :( sorry about that
[09:42:26] <godog>	 the recoveries will notify anyways
[09:43:03] <icinga-wm>	 RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[09:43:22] <icinga-wm>	 RECOVERY - swift-container-server on ms-be3004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[09:43:22] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[09:43:31] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[09:43:31] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[09:43:42] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[09:43:42] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[09:43:42] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be3004 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[09:43:42] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[09:43:43] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be3004 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[09:43:43] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[09:43:52] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[09:43:52] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[09:44:09] <jynus>	 RECOVERIES == good
[09:44:18] <elukey>	 happy people
[09:44:21] <icinga-wm>	 RECOVERY - swift-account-server on ms-be3002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[09:44:51] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Fix template references for replica db config [puppet] - 10https://gerrit.wikimedia.org/r/301082 (owner: 10Jcrespo)
[09:45:03] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[09:45:03] <icinga-wm>	 RECOVERY - swift-account-server on ms-be3004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[09:45:04] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[09:45:21] <icinga-wm>	 RECOVERY - swift-object-server on ms-be3002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[09:46:12] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: puppet fail
[09:52:43] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2495105 (10Addshore) >>! In T140911#2494703, @Dzahn wrote: > the grafana-admin part of this request should be resolved now.  i am not sure about the WebReques...
[09:54:57] <grrrit-wm>	 (03CR) 10Muehlenhoff: "Actually after revisiting the Grafana dashboard, the current memory consumption rather depletes approx every 7-9 days (possibly be increas" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn)
[09:56:54] <grrrit-wm>	 (03PS1) 10Jcrespo: Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 
[09:57:39] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 (owner: 10Jcrespo)
[09:58:27] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) 
[10:00:26] <grrrit-wm>	 (03PS2) 10Jcrespo: Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 
[10:01:38] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 (owner: 10Jcrespo)
[10:02:49] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As already stated, I would rather use something like a hourly cron script that restarts openldap when the used memory reaches N % of the t" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn)
[10:03:35] <icinga-wm>	 PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: puppet fail
[10:04:55] <grrrit-wm>	 (03PS1) 10Jcrespo: More the template to the right subdir [puppet] - 10https://gerrit.wikimedia.org/r/301088 
[10:05:11] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495149 (10elukey) @fgiunchedi looks good to me, the CPU utilization across the cluster seems very good and these hosts don't seem to have special properties...
[10:06:26] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] More the template to the right subdir [puppet] - 10https://gerrit.wikimedia.org/r/301088 (owner: 10Jcrespo)
[10:08:58] <grrrit-wm>	 (03PS1) 10ArielGlenn: add full paths to config files for pagetitles dump from cron [puppet] - 10https://gerrit.wikimedia.org/r/301090 
[10:11:26] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Regarding the comment in the commit message about corp LDAP being hit by this, funnily enough, no, the corp replica is not hit by the issu" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn)
[10:12:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "The graph in https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen&from=1468702800000&to=1469307599999&var-server=" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn)
[10:15:08] <grrrit-wm>	 (03PS2) 10ArielGlenn: add full paths to config files for pagetitles dump from cron [puppet] - 10https://gerrit.wikimedia.org/r/301090 
[10:16:23] <grrrit-wm>	 (03PS1) 10Jcrespo: Set labsdb replicas to have its datadir on /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/301092 
[10:16:26] <_joe_>	 win 25
[10:16:59] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add full paths to config files for pagetitles dump from cron [puppet] - 10https://gerrit.wikimedia.org/r/301090 (owner: 10ArielGlenn)
[10:18:05] <grrrit-wm>	 (03PS2) 10Jcrespo: Set labsdb replicas to have its datadir on /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/301092 
[10:19:32] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Set labsdb replicas to have its datadir on /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/301092 (owner: 10Jcrespo)
[10:21:35] <icinga-wm>	 RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[10:25:34] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495203 (10fgiunchedi) correction: I meant new hardware from one of the pools, i.e. any machine added in https://gerrit.wikimedia.org/r/#/c/290236/ so mw1291...
[10:38:18] <grrrit-wm>	 (03PS1) 10Jcrespo: Update regex to include new labsdb and proxy machines [puppet] - 10https://gerrit.wikimedia.org/r/301095 
[10:43:08] <grrrit-wm>	 (03PS1) 10ArielGlenn: move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) 
[10:43:31] <elukey>	 !log restarting cassandra on aqs100[456] instances (not serving live traffic)
[10:43:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:43:41] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[10:45:22] <grrrit-wm>	 (03PS4) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) 
[10:46:10] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[10:46:37] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[10:51:48] <grrrit-wm>	 (03Draft1) 10Addshore: beta wgEchoMentionStatusNotifications default true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301098 (https://phabricator.wikimedia.org/T140234) 
[10:52:08] <grrrit-wm>	 (03CR) 10Addshore: [C: 04-1] "Not to be merged until the code / depends-on patch is merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301098 (https://phabricator.wikimedia.org/T140234) (owner: 10Addshore)
[10:53:50] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Update regex to include new labsdb and proxy machines [puppet] - 10https://gerrit.wikimedia.org/r/301095 (owner: 10Jcrespo)
[10:54:24] <grrrit-wm>	 (03PS1) 10Jcrespo: Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 
[10:55:10] <grrrit-wm>	 (03PS2) 10Jcrespo: Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 
[10:58:55] <grrrit-wm>	 (03PS3) 10Jcrespo: Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 
[10:59:31] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 (owner: 10Jcrespo)
[10:59:48] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495260 (10Joe) @fgiunchedi I would recommend, as long as this is just experimental, to repurpose mw1152 and then mw1153-60 which are the old imagescalers an...
[11:01:34] <grrrit-wm>	 (03PS1) 10Jcrespo: Update mariadb module to merge new disk_check fix [puppet] - 10https://gerrit.wikimedia.org/r/301100 
[11:01:48] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495263 (10elukey) @Nuria any objection to this access request?
[11:02:03] <grrrit-wm>	 (03PS2) 10Jcrespo: Update mariadb module to merge new disk_check fix [puppet] - 10https://gerrit.wikimedia.org/r/301100 
[11:03:32] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Update mariadb module to merge new disk_check fix [puppet] - 10https://gerrit.wikimedia.org/r/301100 (owner: 10Jcrespo)
[11:06:03] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495265 (10Joe) To be more clear: hardware for the new appservers is not as much in overabundance as we had before; I would suggest not to repurpose machines...
[11:08:44] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495267 (10mark) >>! In T139606#2495265, @Joe wrote: > To be more clear: hardware for the new appservers is not as much in overabundance as we had before; I...
[11:18:37] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[11:18:37] <icinga-wm>	 PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail
[11:20:46] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 20 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:21:57] <grrrit-wm>	 (03PS1) 10Addshore: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) 
[11:22:01] <grrrit-wm>	 (03PS5) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) 
[11:22:14] <grrrit-wm>	 (03CR) 10Addshore: [C: 04-1] "Not yet scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore)
[11:26:05] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: claim mw129[12] for thumbor [dns] - 10https://gerrit.wikimedia.org/r/301106 (https://phabricator.wikimedia.org/T139606) 
[11:26:46] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 19 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[11:28:42] <logmsgbot>	 !log filippo@palladium conftool action : set/pooled=no; selector: name=mw1291.*
[11:28:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:29:34] <logmsgbot>	 !log filippo@palladium conftool action : set/pooled=no; selector: name=mw1292.*
[11:29:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:30:47] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[11:31:39] <_joe_>	 godog: use pooled=inactive when you decide to depool it completely
[11:32:59] <godog>	 _joe_: ok thanks, what's the difference?
[11:35:01] <_joe_>	 godog: inactive makes pybal remove the server from its config altoghether
[11:39:20] <logmsgbot>	 !log filippo@palladium conftool action : set/pooled=inactive; selector: name=mw1292.eqiad.wmnet
[11:39:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:39:27] <logmsgbot>	 !log filippo@palladium conftool action : set/pooled=inactive; selector: name=mw1291.eqiad.wmnet
[11:39:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:42:22] <grrrit-wm>	 (03PS2) 10Gilles: Update Thumbor configuration for python-thumbor-wikimedia 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) 
[11:43:01] <grrrit-wm>	 (03CR) 10Gilles: [C: 04-1] "python-thumbor-wikimedia 1.0.6 needs to be packaged and uploaded to our repo first" [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles)
[11:43:09] <mark>	 _joe_: bit of a weird name then?
[11:43:18] <mark>	 inactive seems to suggest 'present but idling'
[11:47:00] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: reclaim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) 
[11:47:27] <icinga-wm>	 RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:48:13] <moritzm>	 !log installing exim4 updates related to perl security release
[11:48:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:58:56] <icinga-wm>	 PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201)
[12:00:22] <elukey>	 hello event bus
[12:00:57] <icinga-wm>	 RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy
[12:01:04] <elukey>	 will schedule some downtime until ottomata will be online. These are only test events and there might be something weird communicating with kafka 0.9
[12:01:07] <elukey>	 (upgraded yesterday)
[12:04:07] <elukey>	 ah maybe it is due to the new confluent kafka client pkg
[12:04:10] <elukey>	 not sure
[12:15:46] <icinga-wm>	 PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Puppet has 2 failures
[12:16:36] <icinga-wm>	 PROBLEM - Disk space on mx2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied
[12:17:07] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[12:17:59] <wikibugs>	 06Operations: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2495341 (10MoritzMuehlenhoff)
[12:18:23] <godog>	 moritzm: the icinga failure on mx2001 might be related to the upgrades?
[12:20:00] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495354 (10fgiunchedi) >>! In T139606#2495265, @Joe wrote: > I would suggest we pick 2 servers from the scalers pool, so mw1291 and 1292 - at the moment that...
[12:22:31] <moritzm>	 having a look
[12:23:07] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[12:26:39] <godog>	 thanks!
[12:26:47] <icinga-wm>	 RECOVERY - Disk space on mx2001 is OK: DISK OK
[12:27:55] <moritzm>	 the exim4 update also introduced changes scheduled for the next jessie point release, so this needed a manual puppet run to reinstate our permissions for /var/spool/exim4/scan
[12:31:35] <wikibugs>	 06Operations, 10hardware-requests: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2495365 (10Peachey88)
[12:40:37] <icinga-wm>	 RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[12:42:10] <moritzm>	 !log installing perl security updates
[12:42:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:43:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 20 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:49:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 18 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map
[12:56:27] <grrrit-wm>	 (03PS4) 10Sbisson: Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) 
[13:07:02] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2495413 (10BBlack) @Nuria - Thanks, sounds awesome :)
[13:07:57] <grrrit-wm>	 (03PS4) 10BBlack: Add Content-Security-Policy to images from test[2]wiki [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff)
[13:08:22] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Add Content-Security-Policy to images from test[2]wiki [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff)
[13:27:51] <grrrit-wm>	 (03PS3) 10Elukey: Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) 
[13:28:06] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[13:28:47] <grrrit-wm>	 (03CR) 10Elukey: "Reworked following Daniel's suggestion and the ones that came from the ops meeting. Kafka permissions will be added only if needed on a la" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey)
[13:32:01] <wikibugs>	 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2495472 (10Joe)
[13:33:57] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[13:34:36] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[13:34:39] <grrrit-wm>	 (03PS1) 10Jcrespo: Set es2001 check options to warning: 5%, critical: 1%, no page [puppet] - 10https://gerrit.wikimedia.org/r/301117 
[13:39:58] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[13:40:40] <grrrit-wm>	 (03PS2) 10Jcrespo: Set es2001 disk check options to warning: 5%, critical: 1%, no page [puppet] - 10https://gerrit.wikimedia.org/r/301117 
[13:41:45] <wikibugs>	 06Operations, 10ops-eqiad: re-label mirror1001 to sodium - https://phabricator.wikimedia.org/T141105#2495493 (10Cmjohnson) 05Open>03Resolved
[13:43:36] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Set es2001 disk check options to warning: 5%, critical: 1%, no page [puppet] - 10https://gerrit.wikimedia.org/r/301117 (owner: 10Jcrespo)
[13:46:16] <icinga-wm>	 RECOVERY - Disk space on es2001 is OK: DISK OK
[13:46:30] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2480587 (10AlexMonk-WMF) >>! In T140911#2494702, @Dzahn wrote: > @Jonas You should now be able to login at grafana-admin using your wikitech credentials. It a...
[13:46:49] <wikibugs>	 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2495542 (10Joe) I checked all active hosts, and all the differences I found are due basically due to bugs that have been fixed (like dep...
[13:47:02] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2495545 (10Joe)
[13:47:05] <wikibugs>	 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2495543 (10Joe) 05Open>03Resolved
[13:50:27] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10AlexMonk-WMF) 'statistics-users' seems redundant if you've got 'researchers'
[13:52:24] <mafk>	 !next
[13:52:31] <mafk>	 jouncebot next
[13:52:32] <jouncebot>	 In 1 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1500)
[13:54:01] <grrrit-wm>	 (03PS5) 10MarcoAurelio: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) 
[13:54:47] <grrrit-wm>	 (03PS7) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) 
[13:55:01] <jynus>	 !log compressing 300GB table on dbstore2002 (expect warnings, slowdown, lag -but it is a passive analytics slave)
[13:55:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:55:18] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Bacula: Remove old gerrit backup path, unused now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad)
[13:55:20] <grrrit-wm>	 (03PS4) 10MarcoAurelio: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 
[14:06:25] <wikibugs>	 06Operations, 10ops-eqiad: plug frqueue1001 into pfw1- ge-2/0/11 - https://phabricator.wikimedia.org/T141361#2495599 (10Jgreen)
[14:07:55] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495617 (10elukey) >>! In T141339#2495557, @AlexMonk-WMF wrote: > 'statistics-users' seems redundant if you've got 'researchers'  I am not super familiar with the exact diff...
[14:10:34] <grrrit-wm>	 (03PS2) 10ArielGlenn: move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) 
[14:12:12] <wikibugs>	 06Operations, 10ops-eqiad: Survey available/unused ports on eqiad pfw's - https://phabricator.wikimedia.org/T141363#2495639 (10Jgreen)
[14:13:25] <grrrit-wm>	 (03PS3) 10ArielGlenn: move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) 
[14:14:34] <grrrit-wm>	 (03CR) 10Addshore: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore)
[14:16:29] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) (owner: 10ArielGlenn)
[14:27:42] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix name of config file for addschanges dump on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301124 
[14:29:57] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: claim mw129[12] for thumbor [dns] - 10https://gerrit.wikimedia.org/r/301106 (https://phabricator.wikimedia.org/T139606) 
[14:30:11] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] claim mw129[12] for thumbor [dns] - 10https://gerrit.wikimedia.org/r/301106 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi)
[14:32:30] <godog>	 !log reimage mw1291 as thumbor1001
[14:32:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:33:16] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix name of config file for addschanges dump on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301124 (owner: 10ArielGlenn)
[14:35:08] <grrrit-wm>	 (03PS6) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) 
[14:36:36] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) 
[14:36:50] <moritzm>	 !log uploading openjdk-8 security update (8u102-b14-1~bpo8+1) to carbon
[14:36:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:37:53] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032] claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi)
[14:38:02] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) 
[14:38:07] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [V: 032] claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi)
[14:39:16] <grrrit-wm>	 (03PS1) 10ArielGlenn: fix up template location for addschanges config file [puppet] - 10https://gerrit.wikimedia.org/r/301125 
[14:41:31] <grrrit-wm>	 (03PS2) 10ArielGlenn: fix up template location for addschanges config file [puppet] - 10https://gerrit.wikimedia.org/r/301125 
[14:42:43] <grrrit-wm>	 (03PS1) 10Ottomata: Hieraize eventlogging_kafka_handler to allow selection of different kafka clients [puppet] - 10https://gerrit.wikimedia.org/r/301126 
[14:42:48] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fix up template location for addschanges config file [puppet] - 10https://gerrit.wikimedia.org/r/301125 (owner: 10ArielGlenn)
[14:44:15] <grrrit-wm>	 (03PS2) 10Thcipriani: Use hiera for udp2log-mw logrotate count [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) 
[14:45:26] <grrrit-wm>	 (03PS1) 10ArielGlenn: remove dump related classes from snapshot1001 through 1004 [puppet] - 10https://gerrit.wikimedia.org/r/301127 
[14:46:39] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] remove dump related classes from snapshot1001 through 1004 [puppet] - 10https://gerrit.wikimedia.org/r/301127 (owner: 10ArielGlenn)
[14:48:36] <grrrit-wm>	 (03PS1) 10Jforrester: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 
[14:48:38] <grrrit-wm>	 (03PS1) 10Jforrester: Change default gallery mode to 'packed' on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) 
[14:48:47] <grrrit-wm>	 (03CR) 10Thcipriani: "Puppet compiler output: https://puppet-compiler.wmflabs.org/3475/" [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) (owner: 10Thcipriani)
[14:49:06] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester)
[14:49:14] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Change default gallery mode to 'packed' on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester)
[14:50:25] <wikibugs>	 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2495768 (10Jseddon) @akosiaris Thank you so much for that link. To be honest you really don't have to try and recover the entire pad. That serv...
[14:50:43] <wikibugs>	 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2495770 (10Jseddon) p:05High>03Lowest
[14:51:36] <grrrit-wm>	 (03PS2) 10Jforrester: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 
[14:55:18] <grrrit-wm>	 (03PS1) 10ArielGlenn: remove snapshot1001 through 1004 from hiera, mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/301131 
[14:58:57] <wikibugs>	 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2495785 (10GWicke) Based on the limited data so far, read latency seems pretty much unaffected on either host: {F43...
[14:59:16] <icinga-wm>	 PROBLEM - Host lutetium is DOWN: PING CRITICAL - Packet loss = 100%
[14:59:43] <jynus>	 is that the false positive?
[14:59:49] <paravoid>	 it's not exactly a false positive
[14:59:52] <paravoid>	 but yeah, looking into it
[14:59:53] <jynus>	 because network is dumb?
[14:59:59] <mafk>	 swat time!
[15:00:04] <jouncebot>	 anomie, ostriches, thcipriani, hashar, and twentyafterfour: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1500). Please do the needful.
[15:00:04] <jouncebot>	 stephanebisson, mafk, and Addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:12] <paravoid>	 !log installed cr2-eqiad FPC 3
[15:00:16] * mafk present
[15:00:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:00:23] <stephanebisson>	 hello
[15:00:44] <Reedy>	 Who's swatting?
[15:00:51] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2495788 (10Dzahn) AlexMonk is right, i said that because of this line " 31         # Require ldap-group cn=nda,ou=groups,dc=wikimedia,dc=org " but did not not...
[15:00:55] <mafk>	 tyler as usual? :)
[15:01:09] <thcipriani>	 I can SWAT today
[15:01:49] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) (owner: 10Sbisson)
[15:02:15] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) (owner: 10Sbisson)
[15:02:17] * addshore is here :)
[15:02:54] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] remove snapshot1001 through 1004 from hiera, mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/301131 (owner: 10ArielGlenn)
[15:04:08] <grrrit-wm>	 (03PS1) 10ArielGlenn: remove obsolete manifests from snapshot module/role [puppet] - 10https://gerrit.wikimedia.org/r/301132 
[15:05:10] <grrrit-wm>	 (03Abandoned) 10BryanDavis: Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis)
[15:05:25] <icinga-wm>	 RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms
[15:06:13] <thcipriani>	 stephanebisson: change is live on mw1099, check with X-Wikimedia-Debug if applicable please
[15:06:44] <stephanebisson>	 it should be a no-op but I'll test related functionality quickly
[15:06:51] <thcipriani>	 ack, thanks
[15:07:51] <grrrit-wm>	 (03CR) 10Elukey: "Looks good! I added some comments but I am not sure if they are relevant.. The zookeeper_url was the only puzzling part since afaiu it sho" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata)
[15:08:35] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1032 is CRITICAL: CRITICAL - elasticsearch inactive shards 1381 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1303, number_of_pending_tasks: 1533, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 231576, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:35] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1036 is CRITICAL: CRITICAL - elasticsearch inactive shards 1381 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1303, number_of_pending_tasks: 1534, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 231656, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:36] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1025 is CRITICAL: CRITICAL - elasticsearch inactive shards 1378 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1300, number_of_pending_tasks: 1546, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 234524, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:36] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1034 is CRITICAL: CRITICAL - elasticsearch inactive shards 1378 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1300, number_of_pending_tasks: 1546, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 234573, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:36] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1035 is CRITICAL: CRITICAL - elasticsearch inactive shards 1378 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1300, number_of_pending_tasks: 1546, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 234488, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:42] <grrrit-wm>	 (03PS6) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) 
[15:08:46] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1043 is CRITICAL: CRITICAL - elasticsearch inactive shards 1361 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1283, number_of_pending_tasks: 1588, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 244440, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:46] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1019 is CRITICAL: CRITICAL - elasticsearch inactive shards 1361 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1283, number_of_pending_tasks: 1588, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 244519, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:46] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch inactive shards 1361 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1283, number_of_pending_tasks: 1588, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 244582, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:47] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch inactive shards 1357 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1279, number_of_pending_tasks: 1598, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 246874, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:08:56] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1030 is CRITICAL: CRITICAL - elasticsearch inactive shards 1349 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1271, number_of_pending_tasks: 1625, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 252274, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1038 is CRITICAL: CRITICAL - elasticsearch inactive shards 1342 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1264, number_of_pending_tasks: 1669, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 262536, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1047 is CRITICAL: CRITICAL - elasticsearch inactive shards 1342 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1264, number_of_pending_tasks: 1670, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 262600, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1020 is CRITICAL: CRITICAL - elasticsearch inactive shards 1342 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1264, number_of_pending_tasks: 1670, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 262793, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:06] <dcausse>	 looking ^
[15:09:06] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1033 is CRITICAL: CRITICAL - elasticsearch inactive shards 1341 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1263, number_of_pending_tasks: 1680, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 266031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:07] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1037 is CRITICAL: CRITICAL - elasticsearch inactive shards 1341 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1263, number_of_pending_tasks: 1680, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 266280, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:13] <stephanebisson>	 thcipriani: all good
[15:09:15] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1023 is CRITICAL: CRITICAL - elasticsearch inactive shards 1340 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1262, number_of_pending_tasks: 1683, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 267184, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:15] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1045 is CRITICAL: CRITICAL - elasticsearch inactive shards 1338 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1260, number_of_pending_tasks: 1690, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 270412, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:15] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1028 is CRITICAL: CRITICAL - elasticsearch inactive shards 1338 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1260, number_of_pending_tasks: 1690, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 270432, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce
[15:09:18] <Luke081515>	 o.O
[15:09:25] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1044 is CRITICAL: CRITICAL - elasticsearch inactive shards 1333 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1255, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1296, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_a
[15:09:26] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1039 is CRITICAL: CRITICAL - elasticsearch inactive shards 1330 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1252, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4301, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_a
[15:09:36] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch inactive shards 1326 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1248, number_of_pending_tasks: 63, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 13253, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_
[15:09:39] <_joe_>	 gehel: any idea what happened?
[15:09:45] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1022 is CRITICAL: CRITICAL - elasticsearch inactive shards 1321 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1243, number_of_pending_tasks: 105, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 20430, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:09:45] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch inactive shards 1321 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1243, number_of_pending_tasks: 105, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 20532, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:09:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 32, dormant: 0, excluded: 1, unused: 0BRxe-3/3/1: down - BRxe-3/1/4: down - BRxe-3/0/1: down - BRxe-3/2/3: down - BRxe-3/3/3: down - BRxe-3/3/5: down - BRxe-3/0/3: down - BRxe-3/0/5: down - BRxe-3/2/1: down - BRxe-3/1/7: down - BRxe-3/3/6: down - BRxe-3/1/3: down - BRxe-3/1/2: down - BRxe-3/1/5: down - BRxe-3/1/0: down - BRxe
[15:10:05] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1027 is CRITICAL: CRITICAL - elasticsearch inactive shards 1307 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1229, number_of_pending_tasks: 160, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 37384, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:05] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1041 is CRITICAL: CRITICAL - elasticsearch inactive shards 1307 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1229, number_of_pending_tasks: 160, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 37626, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:07] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch inactive shards 1300 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1222, number_of_pending_tasks: 173, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 44521, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:07] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1046 is CRITICAL: CRITICAL - elasticsearch inactive shards 1300 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1222, number_of_pending_tasks: 173, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 44562, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:07] <elukey>	 _joe_ he is out
[15:10:16] <_joe_>	 oh, well
[15:10:16] <elukey>	 but there is a PROBLEM - Router interfaces on cr2-eqiad
[15:10:16] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1040 is CRITICAL: CRITICAL - elasticsearch inactive shards 1291 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1213, number_of_pending_tasks: 190, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 54045, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:25] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1042 is CRITICAL: CRITICAL - elasticsearch inactive shards 1284 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1206, number_of_pending_tasks: 204, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 60337, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:26] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1024 is CRITICAL: CRITICAL - elasticsearch inactive shards 1281 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1203, number_of_pending_tasks: 225, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 63933, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:26] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch inactive shards 1281 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1203, number_of_pending_tasks: 225, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 64000, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent
[15:10:30] <dcausse>	 _joe_: still unclear but network issue I suppose
[15:10:44] <grrrit-wm>	 (03PS7) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) 
[15:11:03] <paravoid>	 no
[15:11:06] <paravoid>	 no network issues
[15:11:08] <paladox>	 ostriches, hi would you be able to review https://gerrit.wikimedia.org/r/#/c/301001/ please.
[15:11:30] <dcausse>	 some nodes stopped to talk to each other
[15:11:33] <_joe_>	 dcausse: let's switch to codfw and debug with ease?
[15:11:58] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:289395|Remove EchoBundleEmailInterval (T135446)]] PART I (duration: 00m 34s)
[15:11:59] <stephanebisson>	 thcipriani: don't know if you've seen my response with all those alerts, all good with the config patch
[15:12:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:12:09] <stashbot>	 T135446: Unsubstituted message footer in mail notification of thanks - https://phabricator.wikimedia.org/T135446
[15:12:19] <_joe_>	 thcipriani: please stop swatting if you see things exploding in here
[15:12:27] <grrrit-wm>	 (03PS2) 10Chad: Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 
[15:12:31] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:289395|Remove EchoBundleEmailInterval (T135446)]] PART II (duration: 00m 26s)
[15:12:32] <stashbot>	 T135446: Unsubstituted message footer in mail notification of thanks - https://phabricator.wikimedia.org/T135446
[15:12:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:12:42] <thcipriani>	 ^ stephanebisson deployed everywhere.
[15:12:43] <dcausse>	 _joe_: don't know... search is still working, let me have a quick look at how many shards were unassigned
[15:12:45] <thcipriani>	 _joe_: ack, stopping
[15:12:52] <_joe_>	 ok
[15:13:08] <grrrit-wm>	 (03PS2) 10ArielGlenn: remove obsolete manifests from snapshot module/role [puppet] - 10https://gerrit.wikimedia.org/r/301132 
[15:13:12] <_joe_>	 thcipriani: rationale is: a) we don't want to overlap effects b) we might need to do a very fast sync-file
[15:14:16] <thcipriani>	 _joe_: yup, makes sense. change had been pushed to test and pulled down to deployment master, wanted to make sure we were in an consistant state before pausing.
[15:14:17] <_joe_>	 dcausse: have you tried restarting one of the segregated nodes?
[15:14:28] <_joe_>	 thcipriani: yeah agreed!
[15:15:59] <_joe_>	 I see the shards are recovering
[15:16:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 1, unused: 0
[15:16:10] <dcausse>	 _joe_: they are all back
[15:16:15] <_joe_>	 that's because the partition wasn't as bad as could've been
[15:16:16] <dcausse>	 recovering
[15:16:25] <paravoid>	 (this alert was spurious and it's recovering because I just told it to ignore those interfaces, fwiw)
[15:16:32] <dcausse>	 trying to understand why the master dropped 6 nodes suddenly
[15:16:44] <_joe_>	 dcausse: what node is the master?
[15:16:51] <paravoid>	 we had a network flap, but it must have lasted for less than 3-4 seconds
[15:17:08] <dcausse>	 _joe_: elastic1030
[15:17:13] <paravoid>	 I can do it again if you want
[15:17:15] <_joe_>	 paravoid: I guess all servers in a row were dropped?
[15:17:16] <paravoid>	 just a master swap
[15:17:19] <_joe_>	 paravoid: not now :)
[15:17:27] <_joe_>	 btw let me check something
[15:17:30] <mark>	 paravoid: that's not really a flap is it
[15:17:57] <_joe_>	 restbase didn't crash
[15:18:07] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] remove obsolete manifests from snapshot module/role [puppet] - 10https://gerrit.wikimedia.org/r/301132 (owner: 10ArielGlenn)
[15:18:32] <paravoid>	 it was really really short
[15:18:44] <paravoid>	 nothing else even noticed really
[15:18:48] <dcausse>	 paravoid: that might be the cause, nodes were removed and re-added in one sec, unfortunaly it was sufficient to cause a recovery
[15:19:13] <mark>	 paravoid: why would that cause unavailability though?
[15:19:19] <paravoid>	 not even 500s noticed, which is one of our most sensitive checks
[15:19:28] <mark>	 during a vrrp master switch the backup doesn't go away
[15:19:39] <_joe_>	 dcausse: [2016-07-26 15:05:19,293][WARN ][cluster.routing.allocation.decider] [elastic1030] after allocating, node [7mpr8WTsQIC_6Gvpza0Emg] would have more than the allowed 20% free disk threshold (16.6% free), preventi
[15:19:43] <_joe_>	 ng allocation
[15:19:48] <_joe_>	 is this ok/normal?
[15:20:01] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Remove old trusty scalers from conftool-data and dsh [puppet] - 10https://gerrit.wikimedia.org/r/301138 (https://phabricator.wikimedia.org/T141352) 
[15:20:06] <dcausse>	 _joe_: it's "normal" yes
[15:21:12] <dcausse>	 I think it's ok to let the cluster recover
[15:21:50] <_joe_>	 dcausse: I agree I was looking at a few logs
[15:21:59] <paravoid>	 yeah, and then do this again a few times :)
[15:21:59] <_joe_>	 and well, you're more expert than me with elasticsearch
[15:22:18] <_joe_>	 I am sure there is some parameter to tune
[15:23:56] <dcausse>	 dropping a node will always cause a decrease in the number of shards and cause icinga alerts, unfortunately the time to recover is not negligible...
[15:23:59] <gehel>	 Hello! Juste saw the message... Can I do anything to help?
[15:24:00] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2495881 (10faidon) We installed the new FPC on cr2-eqiad today — it's now up and online, all of its 32 10G ports.
[15:24:01] <Luke081515>	 _joe_: still need the '@'? ;)
[15:24:10] <dcausse>	 gehel: hi!
[15:24:20] <_joe_>	 Luke081515: nope it just remained attached from kicking the troll yesterday
[15:24:44] <Luke081515>	 ok :)
[15:25:05] <mafk>	 thcipriani: if you could please poke me after swat restarts it'd be appreciated, thank you!
[15:25:05] <_joe_>	 not exactly my main priority (kicking trolls out of IRC chans)
[15:25:18] <thcipriani>	 mafk: yup, will do.
[15:25:20] <grrrit-wm>	 (03PS1) 10ArielGlenn: move dump related commonly included classes out to common role [puppet] - 10https://gerrit.wikimedia.org/r/301139 
[15:25:20] <paravoid>	 Luke081515: does it matter so much? 
[15:25:32] <addshore>	 thcipriani: same for me please :)
[15:25:33] <paravoid>	 dcausse: do you have an approximate time of when the troubles started?
[15:28:25] <_joe_>	 paravoid: it's in general good irc-netiquette not to keep the op tag when not needed - I agree with that
[15:28:34] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] move dump related commonly included classes out to common role [puppet] - 10https://gerrit.wikimedia.org/r/301139 (owner: 10ArielGlenn)
[15:28:39] <paravoid>	 yeah sure, but whatever
[15:28:43] <_joe_>	 I just didn't notice until Luke081515 told me
[15:28:56] <_joe_>	 oh, yes, that was basically my preceding comment too
[15:31:18] <grrrit-wm>	 (03PS1) 10ArielGlenn: add system::role to the dump related roles for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301142 
[15:33:08] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "If at some point we turn on the ferm firewalls in deployment-prep we won't want to restrict the Logstash ports to the deployment-prep proj" [puppet] - 10https://gerrit.wikimedia.org/r/297376 (owner: 10Muehlenhoff)
[15:33:09] <wikibugs>	 06Operations, 10ops-eqiad: Survey available/unused ports on eqiad pfw's - https://phabricator.wikimedia.org/T141363#2495639 (10Cmjohnson) I did a check on all ports and verified each one.   pfw1 0  -> indium 1  -> payment1001 2  -> payment1003 3  -> pay-lvs1001 4  -> pay-lvs1001 eth2 (doesn’t appear to be acti...
[15:33:14] <ostriches>	 lol +1 to whatever on ircops.
[15:33:29] <grrrit-wm>	 (03PS2) 10ArielGlenn: add system::role to the dump related roles for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301142 
[15:34:46] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1041 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 835, number_of_pending_tasks: 29, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4856, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0141720266, 
[15:34:46] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1043 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 835, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0141720266, acti
[15:34:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1025 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 833, number_of_pending_tasks: 25, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 3687, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0359751444, 
[15:34:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1034 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 833, number_of_pending_tasks: 26, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 3865, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0359751444, 
[15:34:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1035 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 833, number_of_pending_tasks: 27, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4046, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0359751444, 
[15:35:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1018 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 832, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0468767034, acti
[15:35:15] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1047 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, acti
[15:35:15] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1028 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, acti
[15:35:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 273, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, ac
[15:35:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1020 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 306, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, ac
[15:35:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1044 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 462, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, ac
[15:35:27] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1019 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 827, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.101384498, activ
[15:35:36] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1046 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 827, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.101384498, activ
[15:35:41] <dcausse>	 the number of shards is cluster wide, maybe we don't need to have this check on all nodes... 
[15:35:55] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1022 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, acti
[15:35:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1040 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1074, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, a
[15:35:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1033 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1769, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, a
[15:35:56] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1031 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1813, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, a
[15:36:05] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1045 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti
[15:36:05] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1030 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti
[15:36:05] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1021 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti
[15:36:05] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1023 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti
[15:36:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1037 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 823, number_of_pending_tasks: 23, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4864, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1449907337, 
[15:36:07] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1039 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 823, number_of_pending_tasks: 22, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4781, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1449907337, 
[15:36:07] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1026 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 823, number_of_pending_tasks: 22, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4751, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1449907337, 
[15:36:15] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1042 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 821, number_of_pending_tasks: 38, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 10840, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1667938515,
[15:36:16] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1038 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 818, number_of_pending_tasks: 50, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 14580, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1994985283,
[15:36:22] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] "Making a Cassandra tunable, tunable via Puppet, seems reasonable to me, (so +1)." [puppet] - 10https://gerrit.wikimedia.org/r/301083 (https://phabricator.wikimedia.org/T140869) (owner: 10Elukey)
[15:36:35] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1017 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 76, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 27008, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.2976125586,
[15:36:35] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1024 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 76, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 27002, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.2976125586,
[15:36:35] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1027 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 806, number_of_pending_tasks: 90, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 32023, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.3303172354,
[15:36:46] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1032 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.3630219121, acti
[15:36:46] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on elastic1036 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.3630219121, acti
[15:37:23] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add system::role to the dump related roles for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301142 (owner: 10ArielGlenn)
[15:37:56] <ostriches>	 dcausse: Maybe just run it on the master?
[15:38:55] <ostriches>	 Er, master_eligible. That'd be 8 nodes :)
[15:39:16] <thcipriani>	 lots of recovery. Can SWAT continue?
[15:39:44] <ostriches>	 dcausse: Could also help detect a split brain, if one master_eligible (but not the others) reported shards missing.
[15:40:09] * ostriches shrugs
[15:42:01] <grrrit-wm>	 (03PS1) 10ArielGlenn: remove 'enable' and 'ensure' class params from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/301143 
[15:42:34] <dcausse>	 ostriches: indeed
[15:43:16] * thcipriani continues SWAT
[15:43:31] <addshore>	 [=
[15:43:52] <thcipriani>	 mafk: if you're around, let's try to get yours out the door.
[15:44:06] <mafk>	 thcipriani: yep, I'm done with the other things
[15:44:19] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio)
[15:44:33] * mafk enables x-wm-debug
[15:44:37] <ostriches>	 dcausse: Plus, 8 failure/recoveries are nicer than like 30 :p
[15:44:49] <ostriches>	 ircspam-- :p
[15:45:33] <grrrit-wm>	 (03PS6) 10Thcipriani: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio)
[15:46:37] <grrrit-wm>	 (03CR) 10Thcipriani: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio)
[15:46:42] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio)
[15:47:07] <godog>	 !log reimage mw1292 as thumbor1002
[15:47:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio)
[15:47:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:48:21] <dcausse>	 ostriches: well, at least I can't miss it :), I'll have a look at this check, err. I'll poke Guillaume about that :)
[15:48:32] <ostriches>	 akosiaris: When you get a minute, I amended that bacula patch. Just trying to clean up all the old crud from ytterbium :)
[15:48:47] <ostriches>	 dcausse: Sounds good. If you need a reviewer feel free to throw me on it.
[15:48:53] <dcausse>	 thanks!
[15:49:06] <icinga-wm>	 PROBLEM - puppetmaster backend https on rhodium is CRITICAL: Connection refused
[15:49:18] <Krenair>	 godog, so are we going to need a thumbor machine in deployment-prep?
[15:49:42] <grrrit-wm>	 (03PS1) 10Thcipriani: Beta: Fix non-puppetmaster errors [puppet] - 10https://gerrit.wikimedia.org/r/301144 
[15:50:00] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad)
[15:50:08] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad)
[15:50:09] <thcipriani>	 mafk: https://gerrit.wikimedia.org/r/#/c/300177/ should be live on mw1099, check please
[15:50:11] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad)
[15:50:20] <mafk>	 thcipriani: ack, checking
[15:50:31] <godog>	 Krenair: mh I don't think so, I've been using deployment-imagescaler01 which was otherwise idle
[15:50:39] <akosiaris>	 ostriches: done :-)
[15:50:43] <godog>	 Krenair: we could rename it tho, I'm not attached to the name
[15:51:00] <Krenair>	 godog, I don't mind the name. I actually wasn't aware we had that machine
[15:51:09] <mafk>	 thcipriani: after changing a namespace name, shouldn't a namespacesDupes be run? (ping Krenair)
[15:51:10] <akosiaris>	 ostriches: thanks for cleaning up btw
[15:51:18] <ostriches>	 No problem!
[15:51:22] <ostriches>	 Like I said in the comments, this is the exact same data we're backing up from lead now, so if there's copies of ytterbium's data, drop away :)
[15:51:24] <Krenair>	 mafk, hmm... can you link me to the change?
[15:51:32] <mafk>	 https://gerrit.wikimedia.org/r/#/c/300177/
[15:51:35] <Krenair>	 I think the answer is yes
[15:51:50] <mafk>	 thcipriani: all looks ok on mw1099
[15:51:57] <gehel>	 dcausse: we do expose all services through lvs, so we could check cluster wide stats there.
[15:52:06] <thcipriani>	 mafk: ack, rolling out everywhere.
[15:52:10] <mafk>	 I don't see the new favicon though, it might be a caché thing
[15:52:12] <gehel>	 dcausse: and not on specific node
[15:52:39] <dcausse>	 gehel: sounds like a good idea, I think it'll work
[15:53:09] <ostriches>	 I don't like that idea, because if lvs is down (which is bad on its own), it causes cascading (and technically incorrect) failures as well.
[15:53:10] <Krenair>	 mafk, this doesn't change namespaces?
[15:53:20] <Krenair>	 oh I just hadn't scrolled down
[15:53:22] <ostriches>	 I prefer to check directly from a node (which is why I suggested a master)
[15:53:22] <Krenair>	 damn gerrit ui change
[15:53:22] <mafk>	 Krenair: renames Wiktionary to translated
[15:53:27] <mafk>	 lol
[15:53:47] <mafk>	 when we get used to it, we'll have to get used to Phabricator differential
[15:54:00] <ostriches>	 Nah, we'll probably get the *new* gerrit UI first :p
[15:54:02] <dcausse>	 checking master eligible nodes is nice also, if all are down search is down anyways
[15:54:03] <addshore>	 mafk: yup :P
[15:54:07] <ostriches>	 (yes, I'm not joking :))
[15:54:07] <mafk>	 :D
[15:54:10] <logmsgbot>	 !log thcipriani@tin Synchronized static/favicon/wiktionary/mk.ico: SWAT: [[gerrit:300177|Configuration changes for mk.wiktionary.org]] PART I (duration: 00m 24s)
[15:54:45] <Krenair>	 certainly when creating or deleting a namespace you'd run namespaceDupes
[15:54:53] <logmsgbot>	 !log thcipriani@tin Synchronized static/images/project-logos/mkwiktionary.png: SWAT: [[gerrit:300177|Configuration changes for mk.wiktionary.org]] PART II (duration: 00m 24s)
[15:54:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:54:58] <mafk>	 I can't do that I think
[15:54:59] <Krenair>	 I'm not sure it matters here. maybe run it without --fix to see
[15:55:21] <thcipriani>	 yup, will give it a shot.
[15:55:41] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:300177|Configuration changes for mk.wiktionary.org]] PART III (duration: 00m 26s)
[15:55:43] <gehel>	 dcausse: we probably want a specific check for master eligible...
[15:56:16] <icinga-wm>	 PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: puppet fail
[15:56:16] <dcausse>	 yes, and maybe move the shard count check to this specific check
[15:56:25] <paladox>	 ostriches i thought that new ui is part of gerrit 3.0.
[15:56:29] * gehel is going back to vacation. Will have a look later tonight...
[15:56:46] <thcipriani>	 yup: 632 links to fix
[15:57:04] <ostriches>	 paladox: Lol, gerrit 3.0 is nowhere near fruition. Polygerrit is probably gonna land in another release or two.
[15:57:11] <paladox>	 Oh
[15:57:17] <paladox>	 It has already landed
[15:57:26] <paladox>	 in gerrit master waiting for release
[15:57:42] <thcipriani>	 mafk: check please
[15:57:52] <paladox>	 but is not the default and will require either us or them to build the war with polygerrit
[15:58:05] <ostriches>	 paladox: Yeah in master, but I dunno if it'll make it into 2.13. There's a bunch of outstanding bugs and missing stuff.
[15:58:09] <paladox>	 polygerrit is broken on Internet Explorer, but works on microsoft edge
[15:58:12] <apergos>	 dare I ask what polygerrit is?
[15:58:18] <ostriches>	 apergos: New UI.
[15:58:19] <paladox>	 It is the new ui
[15:58:22] <paladox>	 For gerrit
[15:58:27] <mafk>	 thcipriani: ack, rechecking, is namespaceDupes already?
[15:58:38] <ostriches>	 apergos: https://gerrit-review.googlesource.com/c/79298/?polygerrit=1 (compare with =0 if you need a reminder of what it looks like now)
[15:58:40] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495974 (10AlexMonk-WMF) >>! In T141339#2495617, @elukey wrote: >>>! In T141339#2495557, @AlexMonk-WMF wrote: >> 'statistics-users' seems redundant if you've got 'researcher...
[15:58:57] <thcipriani>	 mafk: yup, just run
[15:59:00] <paladox>	 ostriches it is deffintly a much better ui
[15:59:06] <apergos>	 my immediate reaction is that I prefer the poly
[15:59:08] <paladox>	 but still needs alot of improvements
[15:59:14] <apergos>	 I find the one we have now too cluttered
[15:59:19] <paladox>	 Yep
[15:59:29] <mafk>	 thcipriani: hmm https://mk.wiktionary.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D1%80%D0%B5%D1%87%D0%BD%D0%B8%D0%BA:%D0%91%D0%BE%D1%82%D0%BE%D0%B2%D0%B8
[15:59:30] <ostriches>	 apergos: It's not feature complete unfortunately. Only in master and hidden behind config flags / url params.
[15:59:33] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Beta: Fix non-puppetmaster errors [puppet] - 10https://gerrit.wikimedia.org/r/301144 (owner: 10Thcipriani)
[15:59:44] <ostriches>	 I'm hoping they finish before 2.13, but we'll see :)
[15:59:45] <apergos>	 bah hermberg
[16:00:00] <apergos>	 I used the new 'edit files' feature just today
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1600). Please do the needful.
[16:00:04] <jouncebot>	 thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:14] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495976 (10elukey) I will follow up adding some clarity to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups
[16:00:21] <apergos>	 to edit the commit message.  it probabyl would have been quicker to pop over to my editor and save/push but eh
[16:00:28] <apergos>	 just because I *could*
[16:00:45] <thcipriani>	 mafk: hmm, ran mwscript namespaceDupes.php mkwiktionary --fix didn't seem to be any problems :\
[16:01:03] <ostriches>	 apergos: My favorite use-case tbh is actually "I see someone else's random patch that has a small typo"
[16:01:04] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Beta: Fix non-puppetmaster errors [puppet] - 10https://gerrit.wikimedia.org/r/301144 (owner: 10Thcipriani)
[16:01:16] <ostriches>	 Rather than actually fixing your own patches (which you likely have sitting on a branch locally anyway)
[16:01:29] <thcipriani>	 mafk: just did a dry run again: 0 pages to fix, 0 were resolvable.
[16:01:30] <apergos>	 this was 'woops I'ma gonna merge this so clearly it's no longer WIP'
[16:01:34] <mafk>	 thcipriani: maybe a bit of time it's still needed, anyway, all looks good (new favicon not showing though, maybe also needs a bit of time)
[16:01:50] <apergos>	 but yeah if it's someone else's patch that is teh perfect
[16:01:55] <paladox>	 https://github.com/gerrit-review/gerrit/tree/master/polygerrit-ui
[16:01:56] <icinga-wm>	 RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.047 second response time
[16:02:01] <paladox>	 apergos ostriches ^^
[16:02:11] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495987 (10Addshore) >>! In T141339#2495976, @elukey wrote: > I will follow up adding some clarity to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups...
[16:02:15] <mafk>	 https://mk.wiktionary.org/wiki/%D0%A1%D0%BF%D0%B5%D1%86%D0%B8%D1%98%D0%B0%D0%BB%D0%BD%D0%B0:%D0%9F%D1%80%D0%B8%D0%B4%D0%BE%D0%BD%D0%B5%D1%81%D0%B8/MarcoAurelio <-- all looks good here, the namespace names were changed rightly
[16:02:19] <thcipriani>	 mafk: ack. ok. mafk addshore : I'm going to bump the rest of the patches for SWAT so I can get out of the way for puppet SWAT.
[16:02:31] <addshore>	 thcipriani: okay!
[16:02:32] <mafk>	 still here :)
[16:02:37] <apergos>	 paladox: bookmarked to look at later
[16:02:39] <apergos>	 thankye
[16:02:44] <paladox>	 Ok and your welcome :)
[16:02:47] <ostriches>	 Yeah we'll have to install node on the gerrit server! I feel all kinds of weird about that :p
[16:03:06] <ostriches>	 java and node together! Just like...no one intended? ;-)
[16:03:07] <apergos>	 next up myphpadmin :-P
[16:03:10] <paladox>	 Oh, maybe node wont be required.
[16:03:21] <paladox>	 ostriches node isent required, you can either use go, or node
[16:03:29] <ostriches>	 apergos: You say that like we've got php installed on the gerrit server :p
[16:03:41] <ostriches>	 paladox: That's not what it says.
[16:03:46] <ostriches>	 It says install node, and go is optional.
[16:03:46] <apergos>	 I believe in planning for the worst case :-P
[16:03:52] <paladox>	 Oh
[16:04:24] <paladox>	 Or you can do
[16:04:25] <paladox>	 buck build polygerrit && \
[16:04:25] <paladox>	 java -jar buck-out/gen/polygerrit/polygerrit.war daemon --polygerrit-dev -d ../gerrit_testsite --console-log --show-stack-
[16:04:28] <paladox>	 ostriches ^^
[16:04:46] <paladox>	 build it with buck, and setup a test instance where you can test it.
[16:04:48] <paladox>	 ?
[16:04:50] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "That would be https://gerrit.wikimedia.org/r/#/c/301071/" [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis)
[16:05:14] <ostriches>	 Heh, I don't need to test it right now, far too immature to spend time on :)
[16:05:18] * ostriches will just wait and watch instead
[16:05:32] <icinga-wm>	 RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:05:33] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495993 (10AlexMonk-WMF) >>! In T141339#2495987, @Addshore wrote: >>>! In T141339#2495976, @elukey wrote: >> I will follow up adding some clarity to https://wikitech.wikimed...
[16:06:50] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] "@Faidon. it's the standard namespace issue we are facing with role::* as long as we import role/*. After getting rid of import role/* we w" [puppet] - 10https://gerrit.wikimedia.org/r/298911 (owner: 10Dzahn)
[16:07:33] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495994 (10Addshore) >>! In T141339#2495993, @AlexMonk-WMF wrote: > Yes. You can find this out in puppet (manifests/site.pp shows it has `role statistics::cruncher`, hierada...
[16:08:22] <grrrit-wm>	 (03CR) 10Faidon Liambotis: "Are these lint warnings like Daniel said or actual autoloader issues? (if it's the former, let's ignore them for now?)" [puppet] - 10https://gerrit.wikimedia.org/r/298911 (owner: 10Dzahn)
[16:08:57] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] ipmi: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298902 (owner: 10Dzahn)
[16:09:41] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 (owner: 10Dzahn)
[16:09:47] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 (owner: 10Dzahn)
[16:09:55] <paladox>	 ostriches im hoping they promote it to stable and remove the flag and just allow you to choose the default in the preference
[16:10:24] <paladox>	 im also hoping they drop the cookie that they use to get it working
[16:11:03] <urandom>	 is there somewhere in eqiad that I could stash ~900G of data?
[16:11:29] * bd808 hands urandom a very large thumb drive
[16:12:18] <urandom>	 bd808: heh
[16:12:50] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 (owner: 10Dzahn)
[16:13:54] <bd808>	 urandom: fluorine is the biggest pool of disk I know about in eqiad but it only has 500G free (of 3.8T)
[16:14:33] <mafk>	 thanks for the swat thcipriani 
[16:14:44] <mafk>	 will schedule the other two for tomorrow if I can
[16:15:13] <thcipriani>	 mafk: ack thanks for the patches, sorry the window ran out :(
[16:15:15] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet  for WMDE-jand - https://phabricator.wikimedia.org/T141339#2496019 (10elukey) Confirmed with ottomata that 'researchers' is the only one needed.
[16:15:22] <godog>	 urandom: yeah for graphite data on labmon it has been done with an external drive iirc
[16:15:50] <mafk>	 thcipriani: not your fault, blame that ElasticSearch thing :P
[16:15:51] <urandom>	 oh damn, the thumb drive wasn't a joke...
[16:16:25] <wikibugs>	 06Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2496020 (10ema)
[16:16:40] <wikibugs>	 06Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2496032 (10ema) p:05Triage>03Normal
[16:17:00] <ostriches>	 urandom, bd808: If only we still had NFS! ;-)
[16:17:19] <bd808>	 poor old netapps
[16:20:12] <mutante>	 has tcy.wiki been created meanwhile?
[16:21:04] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is OK: TCP OK - 0.038 second response time on port 9042
[16:23:34] <mafk>	 mutante: no, I don't think so - however tcy.wikipedia.org redirects to the test project in Incubator
[16:23:51] <urandom>	 i think i know where i could put it (at least temporarily), but it would involve transfering it to codfw; would it be a problem to rsync 900G of data from eqiad to codfw?
[16:24:05] <mutante>	 mafk: yes, i added it to DNS but wanted to know about the actual create_wiki script
[16:24:12] <mutante>	 mafk: looks like it. yea..
[16:24:20] <mutante>	 ok
[16:25:57] <mutante>	 we need to get the mw-config change merged first 
[16:26:27] <mutante>	 not sure if swat-able
[16:26:46] <godog>	 urandom: no, unlikely it is a problem
[16:30:31] <grrrit-wm>	 (03PS1) 10Alex Monk: Replace manually-maintained bastiononly group with the new 'all-users' [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) 
[16:30:38] <wikibugs>	 06Operations, 13Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#2496057 (10AlexMonk-WMF) a:03AlexMonk-WMF
[16:32:56] <grrrit-wm>	 (03PS2) 10ArielGlenn: remove 'enable' and 'ensure' class params from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/301143 
[16:33:52] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[16:33:53] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[16:34:02] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[16:34:17] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] remove 'enable' and 'ensure' class params from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/301143 (owner: 10ArielGlenn)
[16:36:02] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[16:36:03] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge.
[16:36:09] <wikibugs>	 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2496062 (10GWicke) After deploying the changes mentioned in T136957#2485532 yesterday, it looks like today's network issue did not result in any RB m...
[16:36:12] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[16:41:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "makes sense to use upstream tool. seems to work for me on terbium." [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[16:41:57] <grrrit-wm>	 (03PS1) 10ArielGlenn: clean up dumps dirs manifest for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301150 
[16:42:09] <YuviPanda>	 mutante <3 also https://gerrit.wikimedia.org/r/#/c/301059/1?
[16:42:31] <YuviPanda>	 mutante I also wrote some docs https://wikitech.wikimedia.org/wiki/LDAP#Common_LDAP_administrative_actions
[16:44:42] <mutante>	 YuviPanda: the one for groups .. i tried that just now
[16:44:45] <mutante>	 ldapvi -b ou=groups uid="$GROUP"
[16:44:53] <YuviPanda>	 cn=$GROUP
[16:44:55] <YuviPanda>	 not uid=
[16:45:04] <mutante>	 yea, but that is pasted from your change
[16:45:07] <mutante>	 i was about to ask 
[16:45:07] <YuviPanda>	 I should fix
[16:45:11] <YuviPanda>	 right
[16:46:21] <mutante>	 YuviPanda: very useful to check the group and +100 on the comments about removing old access
[16:46:36] <mutante>	 i'll get some food and be back 
[16:46:57] <mutante>	 and yep, makes sense to use the upstream code 
[16:47:11] <YuviPanda>	 cool
[16:48:05] <mutante>	 the only drawback if there is one at all.. it seems a bit easier to make a bad mistake
[16:48:14] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) 
[16:48:15] <YuviPanda>	 mutante updated
[16:48:16] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) 
[16:48:18] <grrrit-wm>	 (03PS7) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) 
[16:48:20] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) 
[16:48:22] <grrrit-wm>	 (03PS2) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) 
[16:49:40] <YuviPanda>	 mutante yeah, but you can review your changes with ldapvi before committing them
[16:50:05] <grrrit-wm>	 (03CR) 10Elukey: "Ran another time pcc and same result: https://puppet-compiler.wmflabs.org/3483/" [puppet] - 10https://gerrit.wikimedia.org/r/301083 (https://phabricator.wikimedia.org/T140869) (owner: 10Elukey)
[16:50:18] <mutante>	 YuviPanda: one more comment. in the new PS now one script is in /usr/local/bin and the other in /usr/local/sbin
[16:50:32] <YuviPanda>	 right, I'll move them to /usr/local/bin
[16:50:36] <mutante>	 YuviPanda: good point, yep
[16:50:45] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] clean up dumps dirs manifest for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301150 (owner: 10ArielGlenn)
[16:51:46] <YuviPanda>	 mutante updated
[16:51:58] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) 
[16:52:00] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) 
[16:53:02] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[16:53:08] <mafk>	 hey, this should not be happening -> https://commons.wikimedia.org/w/index.php?title=User_talk:DerHexer&action=rollback&from=&token=5345d636ef4ede27d1c74a063f57618757979546%2B%5C
[16:53:09] <mutante>	 lgtm
[16:53:15] <grrrit-wm>	 (03PS1) 10ArielGlenn: move cronjobs class from role to snapshot module and add user param [puppet] - 10https://gerrit.wikimedia.org/r/301161 
[17:00:04] <jouncebot>	 yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1700). Please do the needful.
[17:00:34] <subbu>	 will deploy parsoid in a little bit.
[17:03:32] <subbu>	 !log starting parsoid deploy 
[17:03:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:04:51] <grrrit-wm>	 (03PS2) 10ArielGlenn: move cronjobs class from role to snapshot module and add user param [puppet] - 10https://gerrit.wikimedia.org/r/301161 
[17:05:44] <apergos>	 things I love about the new gerrit: it's smart enough to figure out when you've just done a rebase or a commit message update via git on command line
[17:05:48] <apergos>	 used to not be so
[17:06:11] <subbu>	 !log synced new parsoid code; restarted parsoid on wtp1007 as a canary
[17:06:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:07:20] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] move cronjobs class from role to snapshot module and add user param [puppet] - 10https://gerrit.wikimedia.org/r/301161 (owner: 10ArielGlenn)
[17:08:44] <paladox>	 apergos yeh i think there is a bug that if you push an update change it shows the pevous change name.
[17:09:00] <grrrit-wm>	 (03CR) 10BryanDavis: "Looks to be causing:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto)
[17:10:40] <_joe_>	 bd808: heh, you're right
[17:10:45] <thcipriani>	 !log starting branch cut for 1.28.0-wmf.12
[17:10:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:11:03] <_joe_>	 bd808: can that wait until tomorrow morning my time?
[17:11:04] <grrrit-wm>	 (03PS1) 10ArielGlenn: get rid of the useless snapshot cron role wrapper [puppet] - 10https://gerrit.wikimedia.org/r/301167 
[17:11:16] <_joe_>	 on a second puppet run it should fix itself, right?
[17:11:30] <bd808>	 nope. it's permanently hosed
[17:11:37] <_joe_>	 oh, dear
[17:11:42] <_joe_>	 yeah let me fix it
[17:11:43] <grrrit-wm>	 (03CR) 10Chad: [C: 031] Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[17:11:45] <bd808>	 because the initial conversion breaks part way through
[17:11:48] <bd808>	 I've got a ptach
[17:11:58] <_joe_>	 oh, were is it?
[17:12:03] <_joe_>	 *where
[17:12:16] <bd808>	 In my edit buffer :)
[17:12:20] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "yea, i think it could be from long time ago like you say...as long as you are sure it's not used in Labs" [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:12:23] <_joe_>	 AHAH OK
[17:12:30] <_joe_>	 I'll be back in a few then
[17:12:31] <subbu>	 !log finished deploying parsoid version 285b6983
[17:12:34] <subbu>	 time to verify ..
[17:12:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:12:36] <_joe_>	 (was callign it a day)
[17:13:17] <grrrit-wm>	 (03PS1) 10BryanDavis: Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 
[17:13:29] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[17:13:36] <grrrit-wm>	 (03PS8) 10Dzahn: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[17:13:43] <paladox>	 mutante ^^ thanks :) :)
[17:14:54] <paladox>	 mutante https://gerrit.wikimedia.org/r/#/c/301001/8 will need to be c+2 again please.
[17:15:02] <paladox>	 Since you rebased it after doing c+2
[17:15:26] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] get rid of the useless snapshot cron role wrapper [puppet] - 10https://gerrit.wikimedia.org/r/301167 (owner: 10ArielGlenn)
[17:15:43] <mutante>	 i know, just have to wait a moment
[17:15:50] <bd808>	 YuviPanda: want to help unbreak all new self-hosted puppetmasters in Labs? https://gerrit.wikimedia.org/r/#/c/301168/
[17:15:52] <grrrit-wm>	 (03PS9) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) 
[17:16:07] <paladox>	 ok
[17:16:08] <paladox>	 sorry
[17:16:24] <grrrit-wm>	 (03CR) 10Chad: "So I was looking at mysql-connector-java but I can't seem to find it." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad)
[17:16:35] <mutante>	 paladox: what was PS9 ?
[17:16:47] <paladox>	 A rebase
[17:16:51] <paladox>	 it showed merge conflict
[17:16:57] <paladox>	 due to it being fast forward
[17:17:02] <mutante>	 paladox: look at what PS8 was
[17:17:09] <paladox>	 Yeh
[17:17:12] <paladox>	 but showed it again
[17:17:18] <YuviPanda>	 hi bd808
[17:17:21] <paladox>	 due to something being merged after you rebased
[17:17:27] <mutante>	 it's not making it faster to add more PS
[17:17:33] <mutante>	 while waiting for the bot
[17:17:36] <grrrit-wm>	 (03PS2) 10Yuvipanda: Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis)
[17:17:37] <paladox>	 Sorry
[17:17:53] <mutante>	 np, it will be live shortly
[17:18:00] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis)
[17:18:03] <paladox>	 Ok
[17:18:25] <grrrit-wm>	 (03PS1) 10ArielGlenn: include the dumps packages in the dumps manifest in snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/301169 
[17:18:27] <YuviPanda>	 mutante I'm going to wait for ostriches to chime in about modify-ldap-groups and stuff since he also does these things and then merge
[17:18:46] <mutante>	 paladox: now it's merged. thing is that action grrrit-wm did not talk about
[17:18:52] <YuviPanda>	 fucking gerrit with a fucking different submit fucking button aaarfgghh
[17:18:57] <mutante>	 YuviPanda: that sounds good , yes
[17:19:00] <paladox>	 Oh
[17:19:06] <grrrit-wm>	 (03CR) 10Chad: [C: 031] "Fine by me, didn't use it anyway." [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:19:12] <grrrit-wm>	 (03CR) 10Chad: [C: 031] "Fine by me, didn't use it anyway." [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:19:14] <YuviPanda>	 it's like the anti-honeymoon-period all over again
[17:19:16] * bd808 hugs YuviPanda and tells _joe_ to head off to bed/life
[17:19:31] <grrrit-wm>	 (03PS3) 10Yuvipanda: Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis)
[17:19:32] <paladox>	 Lol it showed the change twice ^^
[17:19:33] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis)
[17:19:49] <YuviPanda>	 bd808 yw
[17:19:55] <YuviPanda>	 bd808 I've merged it now
[17:20:12] <grrrit-wm>	 (03PS8) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) 
[17:20:15] <bd808>	 I'll nuke my busted instance and try again :)
[17:20:16] <ostriches>	 I mean I'll have to figure out how to use this ldapvi thing, but I'll live.
[17:20:22] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:20:25] <ostriches>	 (don't let my fear of new things stop you :p)
[17:20:48] <grrrit-wm>	 (03CR) 10Chad: [C: 031] "sure why not" [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:20:59] <grrrit-wm>	 (03CR) 10Chad: [C: 031] "move fast and break things?" [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:21:24] <YuviPanda>	 ostriches I wrote up https://wikitech.wikimedia.org/wiki/LDAP#Common_LDAP_administrative_actions
[17:21:28] <ostriches>	 k
[17:21:35] <YuviPanda>	 I'm trying to move fast, gerrit won't let me
[17:21:48] <grrrit-wm>	 (03PS4) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) 
[17:21:56] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:22:06] <grrrit-wm>	 (03PS3) 10Yuvipanda: ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) 
[17:22:11] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:22:18] <grrrit-wm>	 (03PS4) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) 
[17:22:24] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:22:41] <grrrit-wm>	 (03PS4) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) 
[17:22:46] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda)
[17:23:15] <YuviPanda>	 done!
[17:23:23] <YuviPanda>	 thank you, mutante / ostriches / krenair
[17:23:53] <grrrit-wm>	 (03PS2) 10ArielGlenn: include the dumps packages in the dumps manifest in snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/301169 
[17:26:25] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] include the dumps packages in the dumps manifest in snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/301169 (owner: 10ArielGlenn)
[17:29:50] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:30:28] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures
[17:35:12] <paladox>	 ostriches i now know what .commentPanelMessage did, it's css class name has changed in gerrit 2.12.
[17:36:17] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:37:41] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 04-1] "I *hate* ldaplist (and everything that uses the stupid ldapsupportlib, in fact). I've https://phabricator.wikimedia.org/T114063 open to ki" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris)
[17:38:00] <grrrit-wm>	 (03CR) 10BryanDavis: "Aargh. This actually breaks things worse. Now we have:" [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis)
[17:39:54] <grrrit-wm>	 (03PS1) 10Paladox: Fix gerrit's css class .commentPanelMessage [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) 
[17:40:29] <grrrit-wm>	 (03PS2) 10Paladox: Gerrit: fix gerrit's css class .commentPanelMessage [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) 
[17:40:37] <paladox>	 ostriches ^^
[17:41:00] <grrrit-wm>	 (03PS1) 10Eevans: Enable Cassandra instance restbase2005-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) 
[17:42:42] <grrrit-wm>	 (03PS3) 10Paladox: Gerrit: fix gerrit's css class .commentPanelMessage name [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) 
[17:43:17] <grrrit-wm>	 (03CR) 10Muehlenhoff: "The name of the binary package is libmysql-java, mysql-connector-java is the source package name" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad)
[17:46:11] <grrrit-wm>	 (03CR) 10Eevans: [C: 04-1] "Disregard, just queuing this up for now; I will signal readiness with a +1 when ready." [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[17:46:19] <grrrit-wm>	 (03PS1) 10BryanDavis: role::puppet::self: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/301175 
[17:47:41] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: New Kibana dashboards timing out consistently - https://phabricator.wikimedia.org/T141384#2496381 (10GWicke)
[17:48:07] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: New Kibana dashboards timing out consistently - https://phabricator.wikimedia.org/T141384#2496393 (10GWicke) p:05Triage>03High
[17:48:41] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496381 (10GWicke)
[17:49:03] <grrrit-wm>	 (03PS1) 10Eevans: Enable Cassandra instance restbase1009-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) 
[17:49:42] <grrrit-wm>	 (03PS1) 10Yuvipanda: Add domain labtestspice.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/301177 
[17:49:45] <wikibugs>	 06Operations, 10ops-codfw, 10ops-eqiad: ship 7 ex4200s from codfw to eqiad - https://phabricator.wikimedia.org/T140655#2496402 (10Cmjohnson) 05Open>03Resolved I received the 7 switches, the box used was too big and 4 switches must have rolled around several times inside. The box when it arrived is badly...
[17:50:41] <grrrit-wm>	 (03CR) 10Eevans: [C: 04-1] "Disregard, just queuing this up for now; I will signal readiness with a +1 when ready." [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[17:52:00] <grrrit-wm>	 (03PS1) 10Yuvipanda: cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 
[17:52:06] <wikibugs>	 06Operations, 10ops-eqiad: Survey available/unused ports on eqiad pfw's - https://phabricator.wikimedia.org/T141363#2496417 (10faidon) OK, did a little more investigation.  pfw-eqiad is a cluster of two SRX650s, each with 4x1Gbps built-in ports, 16x1Gbps in a GPIM and 2x10G in an XPIM.  The SRX platform in a c...
[17:52:09] <YuviPanda>	 andrewbogott ^ and the DNS change for misc-varnish!
[17:52:26] <grrrit-wm>	 (03CR) 10Andrew Bogott: "I think ldaplist is a useful tool. Just because it uses a library with bugs in it doens't mean that everything is poisoned start to finis" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris)
[17:53:03] <grrrit-wm>	 (03CR) 10Yuvipanda: "What are the concrete cases for ldaplist? If those are identified and listed I'll happily rewrite it to not suck." [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris)
[17:53:25] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 (owner: 10Yuvipanda)
[17:53:37] <andrewbogott>	 YuviPanda: Thanks!  But the thing I was saying before is still true… that means it'll hit the service on http rather than https, and it doesn't work on http
[17:53:41] <andrewbogott>	 so I still have that problem
[17:54:01] <YuviPanda>	 andrewbogott oooh, it doesn't respect X-Forwarded-Proto
[17:54:02] <YuviPanda>	 ?
[17:54:17] <YuviPanda>	 right, then that's much more complicated then.
[17:54:27] <grrrit-wm>	 (03PS2) 10Yuvipanda: cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 
[17:55:43] <andrewbogott>	 YuviPanda: I don't know… just that there's no config option to tell it whether to talk https or http.  and if I point to it with an http url it says 'Error: Unexpected protocol mismatch.
[17:56:11] <YuviPanda>	 right. so if it respects X-Forwarded-Proto that sohuld work with misc-varnish
[17:56:12] <YuviPanda>	 but we can test that I think
[17:56:18] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496429 (10bd808) I think this has something to do with the search term highlighting that kibana4 does server side. I'll poke at it a bit and s...
[18:00:53] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2496454 (10Cmjohnson)
[18:00:55] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2496452 (10Cmjohnson) 05Open>03Resolved I received the disk and replaced it  root@ms-be1022:~# hpssacli ctrl slot=3 ld all show status     logicaldrive 1 (186.3 GB, 0): OK    logica...
[18:01:11] <wikibugs>	 06Operations, 10ArchCom-RfC, 06Services, 07Archcom-has-shepherd, 07RfC: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#2496458 (10RobLa-WMF)
[18:02:04] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496461 (10GWicke) Thanks, @bd808!
[18:03:05] <grrrit-wm>	 (03PS1) 10ArielGlenn: provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 
[18:03:37] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.113:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.113, port=9200): Read timed out. (read timeout=4)
[18:05:21] <wikibugs>	 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2496465 (10Cmjohnson) I requested 2 SSD"s to be sent and the confirmation email states 2 SSD's but they actually sent me 2 4TB HDD's instead.  A call to them has to take place.
[18:07:04] <bd808>	 !log Restarted elasticsearch on logstash1003, couldn't find master
[18:07:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:07:29] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards
[18:08:06] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496490 (10bd808) From [[https://www.elastic.co/guide/en/kibana/current/advanced-options.html|upstream docs]]: > **doc_table:highlight** > High...
[18:08:10] <addshore>	 thcipriani: and also, I would still be around after the train if you would have the time!
[18:09:20] <thcipriani>	 addshore: kk, I'll ping you when I'm done Train-ing.
[18:12:34] <bd808>	 varnish is hating kibana4 right now :/
[18:13:51] <bblack>	 bd808: in what way?
[18:14:45] <bd808>	 I think the kibana4 nodes are timing out on some queries and then varnish is marking the node as offline for a few minutes
[18:15:16] <bd808>	 and they we run out of nodes and get a "service is dead" type response from varnish
[18:15:24] <bd808>	 I don't think varnish is actually doing anything wrong here
[18:15:29] <grrrit-wm>	 (03PS2) 10ArielGlenn: provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 
[18:15:32] <bd808>	 just bad alignment of stars
[18:15:56] <bd808>	 also I continue to hate kibana4 :)
[18:16:28] <bd808>	 adding a node proxy into the mix does not make it more stable or performant as far as I can tell
[18:19:08] <ostriches>	 bd808: Something something, onions and layers :p
[18:19:37] <apergos>	 bad alignment of tors
[18:19:55] <bd808>	 YuviPanda: I really broke self-hosted puppet with that patch I tricked you into merging. I think that https://gerrit.wikimedia.org/r/#/c/301175/1 will unbreak both the new and old breakage
[18:20:43] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] role::puppet::self: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/301175 (owner: 10BryanDavis)
[18:20:55] <YuviPanda>	 bd808 done
[18:21:04] <grrrit-wm>	 (03PS2) 10Addshore: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) 
[18:21:11] <grrrit-wm>	 (03CR) 10Addshore: [C: 032] Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore)
[18:21:36] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore)
[18:22:03] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 (owner: 10ArielGlenn)
[18:22:13] <grrrit-wm>	 (03PS3) 10ArielGlenn: provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 
[18:25:18] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Replace manually-maintained bastiononly group with the new 'all-users' [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk)
[18:25:45] <bd808>	 YuviPanda: thanks
[18:26:58] <grrrit-wm>	 (03CR) 10Chad: "Ah ok, makes sense. I was looking at libbcprov-java too, but it looks like it might be too dated for our use :\" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad)
[18:28:04] <logmsgbot>	 !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: Enable RevisionSlider on mediawikiwiki {{gerrit|301105}} (duration: 01m 28s)
[18:28:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:30:16] <grrrit-wm>	 (03PS1) 10ArielGlenn: use fixed repodir setting for dumps jobs [puppet] - 10https://gerrit.wikimedia.org/r/301181 
[18:31:10] <grrrit-wm>	 (03PS4) 10Dzahn: Gerrit: fix gerrit's css class .commentPanelMessage name [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[18:32:22] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "very similar to the issue we merged earlier, but for the commit-message. noticed that issue with the scrollbar too" [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[18:33:28] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] use fixed repodir setting for dumps jobs [puppet] - 10https://gerrit.wikimedia.org/r/301181 (owner: 10ArielGlenn)
[18:33:52] <grrrit-wm>	 (03CR) 10BryanDavis: "This breaks the dependency cycle, but on the one test host I built with it (striker-deploy01.striker.eqiad.wmflabs) the initial puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/301175 (owner: 10BryanDavis)
[18:34:18] <grrrit-wm>	 (03PS5) 10Dzahn: Gerrit: fix gerrit's css class .commentPanelMessage name [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[18:36:23] <logmsgbot>	 !log addshore@tin Synchronized php-1.28.0-wmf.11/extensions/WikimediaEvents/WikimediaEventsHooks.php: dewiki_diffstats add rev timestamps & feature state {{gerrit|301119}} (duration: 00m 28s)
[18:36:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:36:51] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2496549 (10Jgreen) 05Open>03Resolved
[18:37:19] <addshore>	 thcipriani: nothing exploded ;)
[18:37:43] <grrrit-wm>	 (03CR) 10Dzahn: [V: 032] "was tested on http://gerrit-test.wmflabs.org/gerrit/#/c/16/" [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox)
[18:37:45] <thcipriani>	 addshore: \o/ kudos on the first successful "swat"-ish :)
[18:39:25] <paladox>	 mutante ^^ thanks
[18:40:39] <paladox>	 commit-msg should now be pre-wrapped again in gerrit :)
[18:40:44] <mutante>	 paladox: applied
[18:40:46] <paladox>	 ostriches ^^
[18:40:47] <paladox>	 thanks
[18:41:30] <grrrit-wm>	 (03CR) 10Dzahn: "yes, should be limited to eventbus things. if kafka things are needed in the future that should probably be kafka-admins or so" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey)
[18:42:18] <grrrit-wm>	 (03CR) 10Dzahn: "i think the * in the sudo line means that it actually allows controlling any service. better to avoid that wildcard and actually list the " [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey)
[18:47:03] <grrrit-wm>	 (03PS1) 10ArielGlenn: add notes about manual additions needed for new snapshot nodes [puppet] - 10https://gerrit.wikimedia.org/r/301182 
[18:49:24] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] add notes about manual additions needed for new snapshot nodes [puppet] - 10https://gerrit.wikimedia.org/r/301182 (owner: 10ArielGlenn)
[18:58:39] <bblack>	 bd808: we can alter the timeouts for varnish->kibana, too
[18:58:58] <bblack>	 bd808: there's separate timeouts for first connect, first byte, idle between response bytes, etc
[19:00:04] <jouncebot>	 thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1900).
[19:00:16] * thcipriani does
[19:02:57] <grrrit-wm>	 (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301186 
[19:07:26] <logmsgbot>	 !log thcipriani@tin Purged l10n cache for 1.28.0-wmf.10
[19:07:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:08:10] * Reedy wonders which further extensions can be moved to extenson.json in extension-list after todays deploy
[19:09:19] <logmsgbot>	 !log thcipriani@tin Started scap: testwiki to php-1.28.0-wmf.12 and rebuild l10n cache
[19:09:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:11:49] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496627 (10EBernhardson) This also looks like it could be related to the mapping that is generated for restbase:  ``` ebernhardson@logstash1001...
[19:12:31] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496642 (10bd808) Timeouts still seem to happen (and I still see highlighted terms on other dashboards that load).  More strangeness: * [[https...
[19:12:52] <bd808>	 ebernhardson: jinx! I think we came to the same conclusion
[19:13:15] <bd808>	 the records out of restbase are a bit too structured
[19:15:20] <ebernhardson>	 bd808: :)
[19:16:42] <ebernhardson>	 i had noticed that before and wondered as well, but it didn't seem to be causing issues so didn't think about it till now
[19:16:54] <grrrit-wm>	 (03PS1) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 
[19:17:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda)
[19:18:39] <grrrit-wm>	 (03PS2) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 
[19:18:41] <grrrit-wm>	 (03PS2) 10Yuvipanda: python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 
[19:19:39] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496670 (10GWicke) > This also looks like it could be related to the mapping that is generated for restbase  Has this mapping changed recently,...
[19:22:35] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey)
[19:23:02] <grrrit-wm>	 (03CR) 10Dzahn: "yep, please remove the * and list the actual subcommands" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey)
[19:31:27] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on port 9042
[19:32:03] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496712 (10GWicke) > My new theory about why kibana4 and restbase aren't getting along is the incredibly high cardinality of fields in the rest...
[19:33:10] <urandom>	 !log T140825: Setting vm.dirty_background_bytes=24576 (restbase1009.eqiad.wmnet)
[19:33:11] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:33:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:33:33] <grrrit-wm>	 (03PS3) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 
[19:33:40] <urandom>	 !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-a.eqiad.wmnet)
[19:33:41] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:33:42] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:33:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:34:48] <logmsgbot>	 !log thcipriani@tin Finished scap: testwiki to php-1.28.0-wmf.12 and rebuild l10n cache (duration: 25m 29s)
[19:34:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:37:41] <urandom>	 !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-b.eqiad.wmnet)
[19:37:43] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:37:43] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:37:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:38:44] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301186 (owner: 10Thcipriani)
[19:38:57] <grrrit-wm>	 (03PS5) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 
[19:39:10] <grrrit-wm>	 (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301186 (owner: 10Thcipriani)
[19:40:31] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.12
[19:40:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:42:27] <grrrit-wm>	 (03PS6) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 
[19:42:50] <urandom>	 !log T140825: Setting vm.dirty_background_bytes=24576 (restbase1014.eqiad.wmnet)
[19:42:51] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:42:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:43:05] <urandom>	 !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-a.eqiad.wmnet)
[19:43:06] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:43:07] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:43:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:45:01] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496746 (10GWicke) The best bet for the source of those `err_*` keys I have so far is https://github.com/wikimedia/hyperswitch/blob/a884a7c1afe...
[19:49:36] <urandom>	 !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-b.eqiad.wmnet)
[19:49:38] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:49:38] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:49:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:53:45] <urandom>	 !log T140825: Setting vm.dirty_background_bytes=24576 (restbase1015.eqiad.wmnet)
[19:53:46] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:53:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:54:06] <urandom>	 !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-a.eqiad.wmnet)
[19:54:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:54:18] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:54:18] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:57:15] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496801 (10bd808) >>! In T141384#2496712, @GWicke wrote: >> My new theory about why kibana4 and restbase aren't getting along is the incredibly...
[19:57:28] <grrrit-wm>	 (03CR) 10Brian Wolff: "Note, this will change the default thumbnail size fetched for galleries, which could cause a spike in requests to thumbnail servers. Given" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester)
[19:58:40] <urandom>	 !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-b.eqiad.wmnet)
[19:58:42] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:58:42] <stashbot>	 T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825
[19:58:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:58:57] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master).
[19:59:53] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 (owner: 10Yuvipanda)
[20:01:28] <grrrit-wm>	 (03Merged) 10jenkins-bot: python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 (owner: 10Yuvipanda)
[20:01:51] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2496812 (10Gehel) It would probably be easier for @Jonas to have direct access to the nginx logs on the wdqs servers. I'm not familiar to how we handle that (...
[20:04:14] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] "I'm ready for this to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[20:04:40] <urandom>	 mutante: I have a couple of Cassandra instances to do, can you help?
[20:04:53] <grrrit-wm>	 (03PS2) 10Eevans: Enable Cassandra instance restbase1009-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) 
[20:05:26] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 04-1] Fix generic webservices (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda)
[20:06:14] <mutante>	 urandom: yes
[20:06:51] <urandom>	 mutante: cool, https://gerrit.wikimedia.org/r/301176 should be first, and then once i have that running, https://gerrit.wikimedia.org/r/#/c/301174
[20:10:52] <mutante>	 ok, sec
[20:10:56] <mutante>	 urandom: on it now
[20:11:00] <urandom>	 mutante: kk
[20:11:34] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase1009-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[20:12:06] <mutante>	 urandom: first one is active now
[20:12:17] <urandom>	 mutante: thanks
[20:15:19] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496842 (10Pchelolo) Actually the reason of these `err` keys is that sometimes we log the full request/responce. If it's an object with randomi...
[20:16:28] <grrrit-wm>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/301192 
[20:23:57] <urandom>	 !log T134016: Bootstrapping restbase1009-c.eqiad.wmnet 
[20:23:59] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[20:24:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:24:02] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2496878 (10Dzahn) Would it be ok with everyone here if we confirm the grafana part works, close this ticket as resolved (since it was all about NDA) and open...
[20:26:47] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] "Ready." [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[20:26:51] <grrrit-wm>	 (03PS2) 10Eevans: Enable Cassandra instance restbase2005-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) 
[20:27:13] <urandom>	 mutante: ready for https://gerrit.wikimedia.org/r/#/c/301174/ whenever you are.
[20:27:21] <urandom>	 mutante: no rush btw.
[20:27:57] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase2005-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans)
[20:28:22] <mutante>	 urandom: no problem, done
[20:28:24] <grrrit-wm>	 (03PS7) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 
[20:28:32] <urandom>	 mutante: awesome, thanks!
[20:28:41] <mutante>	 urandom: i like that how you prepare them and then mark +1 when it's time 
[20:29:17] <urandom>	 mutante: ok, cool; i'll be sure to keep doing that
[20:29:23] <grrrit-wm>	 (03CR) 10Yuvipanda: Fix generic webservices (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda)
[20:29:36] <YuviPanda>	 bd808 ^ fixed Tue, but no need for .extend
[20:33:57] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496912 (10EBernhardson) >>! In T141384#2496670, @GWicke wrote: >> This also looks like it could be related to the mapping that is generated fo...
[20:36:14] <mdholloway>	 thcipriani: o/ MW train done?  mind if i do a quick mobileapps deployment?
[20:36:44] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused
[20:36:50] <urandom>	 got it ^^
[20:37:09] <urandom>	 !log Bootstrapping restbase2005-c.eqiad.wmnet
[20:37:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:37:53] <thcipriani>	 mdholloway: yup, train is done, all yours
[20:37:54] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-29 20:37:34.
[20:38:08] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused daniel_zahn .
[20:38:13] <mdholloway>	 thcipriani: great, thx!
[20:38:33] <grrrit-wm>	 (03PS9) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) 
[20:39:28] <grrrit-wm>	 (03CR) 10ArielGlenn: "updated this patch after move of misc crons to snapshot1007 and refactor of cron job classes" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn)
[20:40:26] <mutante>	 YuviPanda: your reset-ldap-password change has a problem on terbium
[20:41:12] <YuviPanda>	 mutante uh, what is the issue?
[20:41:18] <YuviPanda>	 also can you file a bug? I"m about to go for lunch just now
[20:41:47] <mdholloway>	 !log starting mobileapps deployment
[20:41:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:42:02] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496951 (10EBernhardson) >>! In T141384#2496842, @Pchelolo wrote: > Actually the reason of these `err` keys is that sometimes we log the full r...
[20:42:09] <mutante>	 YuviPanda: sure, you got a bug when you get back :)
[20:42:21] <mutante>	 Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/ldap/reset-password
[20:43:28] <grrrit-wm>	 (03PS1) 10Yuvipanda: Add maven to jdk8 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/301198 
[20:43:40] <grrrit-wm>	 (03PS1) 10Yuvipanda: ldap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/301199 
[20:43:42] <YuviPanda>	 mutante ah, ^ should fix that.
[20:43:44] <YuviPanda>	 can you merge?
[20:43:47] <YuviPanda>	 (the puppet change)
[20:43:49] <mutante>	 looking
[20:43:55] <mdholloway>	 !log mobileapps deployed fd3f33b
[20:43:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:44:09] <mutante>	 oh! sure
[20:44:41] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] ldap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/301199 (owner: 10Yuvipanda)
[20:45:38] <mutante>	 oh, first time i see "submit including parents"
[20:45:57] <grrrit-wm>	 (03PS2) 10Dzahn: ldap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/301199 (owner: 10Yuvipanda)
[20:45:57] <Krenair>	 sounds dangerous
[20:46:01] <mutante>	 yes
[20:49:00] <mafk>	 hmm... do we have to run any script to update the favicon of a project?
[20:49:14] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Add maven to jdk8 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/301198 (owner: 10Yuvipanda)
[20:49:32] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add maven to jdk8 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/301198 (owner: 10Yuvipanda)
[20:50:35] <moritzm>	 hashar: thanks for working on CI integration for operations/debs, but please exclude "linux" and "linux44", these are really massive builds and it only makes sense to trigger a test build manually
[20:51:24] <moritzm>	 it's even likely that they deplete disk space on the build VMs entirely...
[20:52:45] <mutante>	 mafk: it's created by favicon.php i think
[20:53:24] <mafk>	 I updated mk.wiktionary favicon on today's morning SWAT and while the patch is merged, favicon is still the old one mutante
[20:54:02] <mafk>	 sync-common-file?
[20:58:57] <grrrit-wm>	 (03PS1) 10Reedy: 4 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301283 (https://phabricator.wikimedia.org/T139800) 
[20:59:33] <grrrit-wm>	 (03CR) 10Hashar: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/301192 (owner: 10Hashar)
[21:00:06] <mutante>	 mafk: i dont know. maybe the SWATter person odes
[21:00:08] <mutante>	 does
[21:01:09] <urandom>	 mutante: how would you transfer data between an arbitrary eqiad host and a codfw host?
[21:01:39] <hashar>	 netcat + openssl ?
[21:01:51] <urandom>	 hashar: what about resuming a broken transfer?
[21:01:55] <mutante>	 urandom: if it's small and not secret, i would use scp to localhost, scp -3
[21:02:05] <urandom>	 mutante: it's ~900G
[21:02:09] <hashar>	 rsync!!!!!!
[21:02:14] <urandom>	 hashar: how?
[21:02:19] <urandom>	 hashar: over ssh?
[21:02:20] <mutante>	 urandom: i let puppet setup rsyncd
[21:02:31] <mutante>	 urandom: no, over rsync protocol to rsycnd
[21:02:33] <hashar>	 no need for rsyncd
[21:02:38] <hashar>	 if you get ssh
[21:02:43] <urandom>	 mutante: OK, yeah, someone else suggested that
[21:02:45] <mutante>	 you dont get ssh
[21:03:01] <mutante>	 urandom: you can copy it from an existing class.. hold on
[21:03:04] <urandom>	 mutante: do you know of a good example to crib from the repo?
[21:03:32] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] 4 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301283 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy)
[21:03:51] <mutante>	 urandom: modules/role/manifests/lists/migration.pp for example
[21:03:58] <grrrit-wm>	 (03Merged) 10jenkins-bot: 4 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301283 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy)
[21:04:44] <mutante>	 urandom: so you put a role like that on the target server first, and adjust the source IP in there
[21:05:10] <urandom>	 mutante: nice; and ferm too i assume
[21:05:19] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/extension-list: moar extension.json (duration: 00m 33s)
[21:05:21] <mutante>	 urandom: that will add rsyncd and ferm rules and  with rsync::server::module you add oneor more modules
[21:05:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:06:28] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[21:07:05] <mutante>	 urandom: just that if you remove that class again, puppet will not clean up. gotta remember to manually kill rsyncd and the configs when done
[21:07:19] <urandom>	 :/
[21:07:36] <urandom>	 yeah, this is temporary...
[21:07:55] <mutante>	 is it a thing where you also have to copy the data back in the other direction ?
[21:07:56] <urandom>	 and it needs to be added to only a small number of hosts (3)
[21:08:09] <urandom>	 mutante: some day
[21:08:11] <mutante>	 in that case i edit the IP and move the role to another node
[21:08:57] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: Connection refused
[21:09:02] <urandom>	 got it ^^
[21:09:14] <grrrit-wm>	 (03PS1) 10Chad: WIP: Bring over php::ini from MediaWiki vagrant [puppet] - 10https://gerrit.wikimedia.org/r/301285 
[21:09:38] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-27 21:09:22.
[21:09:39] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: Connection refused daniel_zahn .
[21:14:20] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2497030 (10GWicke) > Perhaps there is something similar that can be handled in node to filter the output?  Yes, we can definitely sanitize this...
[21:15:55] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda)
[21:17:18] <grrrit-wm>	 (03Merged) 10jenkins-bot: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda)
[21:19:38] <wikibugs>	 06Operations, 10Cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10GWicke) Another option for large-scale Cassandra testing that we can pursue in parallel is using cloud infrastructure like GCE. A recent demo showed about [1000 Ca...
[21:21:40] <grrrit-wm>	 (03CR) 10Chad: "Passed puppet compiler! https://puppet-compiler.wmflabs.org/3491/lead.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/300930 (owner: 10Chad)
[21:22:21] <grrrit-wm>	 (03PS2) 10Chad: Gerrit: Remove all the junk to support 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/300930 
[21:26:06] <grrrit-wm>	 (03PS1) 10Dzahn: ldap: fix typo for reset-password script name [puppet] - 10https://gerrit.wikimedia.org/r/301288 
[21:27:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] ldap: fix typo for reset-password script name [puppet] - 10https://gerrit.wikimedia.org/r/301288 (owner: 10Dzahn)
[21:29:47] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[21:44:06] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Set up spice-based remote consoles for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/301294 (https://phabricator.wikimedia.org/T141399) 
[21:47:07] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2497082 (10ellery) @BBlack, @Nuria  In order to run a randomized controlled experiment, you need to ensure that users are randomly assigned to treatment conditions at the start...
[21:50:23] <paladox>	 ostriches, i spoke with qchris_ they seem to now be prioritising gerrit's ui, they going with the new ui which i presume is polygerrit
[21:50:28] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Add domain labtestspice.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/301177 (https://phabricator.wikimedia.org/T141399) (owner: 10Yuvipanda)
[21:50:35] <paladox>	 mmeaning there will be less likly to fix the current ui
[21:50:40] <paladox>	 since there will be no point
[21:50:43] <paladox>	 mutante ^^
[21:51:12] <paladox>	 RoanKattouw, seems the only way the problem you had with the diffs will be when they fix it in the next new ui
[21:56:56] <wikibugs>	 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2497094 (10GWicke) A PR removing the full request body from storage backend errors is now available at https://github.com/wikimedia/restbase-mo...
[22:10:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[22:10:57] <grrrit-wm>	 (03PS1) 10Dzahn: restbase-test: setup rsync for data from cassandra-test [puppet] - 10https://gerrit.wikimedia.org/r/301303 
[22:11:08] <mutante>	 paladox: yep, ok
[22:11:21] <paladox>	 mutante not sure why your saying yep ok.
[22:12:07] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479
[22:12:36] <mutante>	 paladox: well, you pinged me
[22:12:46] <paladox>	 Oh
[22:12:55] <mutante>	 and i saw what you were pointing to, but i dont have much to add right now
[22:12:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[22:13:42] <paladox>	 ok
[22:14:07] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5065553 keys - replication_delay is 0
[22:27:37] <grrrit-wm>	 (03CR) 10Jforrester: [C: 04-1] "Let's not be hasty." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester)
[22:31:34] <grrrit-wm>	 (03PS1) 10Ppchelko: Change-Prop: Increase maximum concurrency for ORES [puppet] - 10https://gerrit.wikimedia.org/r/301305 
[22:32:49] <wikibugs>	 06Operations, 10Analytics: stat1004 doesn't show up in ganglia - https://phabricator.wikimedia.org/T141360#2497199 (10Peachey88)
[22:37:34] <wikibugs>	 06Operations, 10MediaWiki-Releasing, 10Parsoid, 06Release-Engineering-Team: debian signing keyid E84AFDD2 has expired - https://phabricator.wikimedia.org/T141400#2497211 (10greg)
[22:44:11] <grrrit-wm>	 (03PS1) 10BBlack: Revert "ssl_ciphersuite: drop non-FS AES256 options" [puppet] - 10https://gerrit.wikimedia.org/r/301307 
[22:44:32] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "ssl_ciphersuite: drop non-FS AES256 options" [puppet] - 10https://gerrit.wikimedia.org/r/301307 (owner: 10BBlack)
[22:44:49] <grrrit-wm>	 (03CR) 10Ladsgroup: [C: 031] Change-Prop: Increase maximum concurrency for ORES [puppet] - 10https://gerrit.wikimedia.org/r/301305 (owner: 10Ppchelko)
[22:48:42] <wikibugs>	 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2497236 (10GWicke) >>! In T136957#2491863, @MoritzMuehlenhoff wrote: >> - stdout is apparently ignored & does not make it into the systemd journal. >...
[22:54:30] <wikibugs>	 06Operations, 10MediaWiki-Releasing, 10Parsoid, 06Release-Engineering-Team: debian signing keyid E84AFDD2 has expired - https://phabricator.wikimedia.org/T141400#2497264 (10greg) We should probably update https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org (or add a new page and link to it from [[Re...
[23:00:04] <jouncebot>	 RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T2300).
[23:00:04] <jouncebot>	 James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:38] <James_F>	 Heya.
[23:00:41] <James_F>	 Is it just me?
[23:00:54] <Reedy>	 Got someone to deploy it?
[23:00:58] <James_F>	 Nope.
[23:01:05] * Reedy looks
[23:01:11] <mafk>	 well, if I can re-schedule a couple of patches from morning swat reedy?
[23:01:14] <James_F>	 It's "just" a de-deployment. :-)
[23:01:27] <grrrit-wm>	 (03PS3) 10Reedy: De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) (owner: 10Jforrester)
[23:01:28] <mafk>	 (window time went off due to elasticsearch failures)
[23:01:54] <Reedy>	 mafk: I can do some if you need
[23:02:09] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) (owner: 10Jforrester)
[23:02:12] <mafk>	 Reedy: thanks, let me find the gerrit links
[23:02:28] <James_F>	 Reedy: I guess we should sync Common, then Init*, then extension-list?
[23:02:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) (owner: 10Jforrester)
[23:02:48] <Reedy>	 extension-list is mostly a no-op
[23:02:51] <James_F>	 Sure.
[23:02:53] <Reedy>	 but yeah, out of CS first
[23:03:01] <mafk>	 https://gerrit.wikimedia.org/r/#/c/300758/
[23:03:17] <mafk>	 and https://gerrit.wikimedia.org/r/#/c/300880/
[23:04:05] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/CommonSettings.php: Undeploy ImageMetrics (duration: 00m 27s)
[23:04:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:04:31] <Reedy>	 looks to be gone from Special:Version
[23:04:54] <James_F>	 Reedy: On 1099?
[23:05:11] <Reedy>	 Everywhere!
[23:05:48] <James_F>	 Gah.
[23:05:49] <James_F>	 Tut.
[23:05:50] <James_F>	 OK.
[23:06:15] <James_F>	 Sync the rest then. :-)
[23:06:35] <grrrit-wm>	 (03PS1) 10GWicke: Service::node: Capture stdout and stderr in journal [puppet] - 10https://gerrit.wikimedia.org/r/301309 (https://phabricator.wikimedia.org/T136957) 
[23:06:59] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/: Remove rest of ImageMetrics config (duration: 00m 33s)
[23:07:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:07:43] <grrrit-wm>	 (03PS5) 10Reedy: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 (owner: 10MarcoAurelio)
[23:07:53] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 (owner: 10MarcoAurelio)
[23:08:18] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 (owner: 10MarcoAurelio)
[23:08:39] <Reedy>	 error: insufficient permission for adding an object to repository database /srv/mediawiki-staging/.git/modules/wmf-config/event-schemas/objects
[23:08:40] <Reedy>	 ffs
[23:08:55] <Reedy>	 WHY are numerous root:root
[23:09:18] <James_F>	 Reedy: Probably done in a hurry by o.ri?
[23:09:46] <wikibugs>	 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2497291 (10BBlack) >>! In T135762#2497082, @ellery wrote: > As far as I can tell, the proposed method also violates the more important property that  users need to be randomly a...
[23:09:58] <icinga-wm>	 PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:10:17] <icinga-wm>	 PROBLEM - puppet last run on cp1040 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:10:58] * Reedy goes to look for a root
[23:11:53] <ebernhardson>	 perhaps scap needs something that blows up when files are owned by incorrect user? Although perhaps that's out of scope...
[23:11:57] <icinga-wm>	 PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:12:16] <icinga-wm>	 PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Puppet has 2 failures
[23:12:17] <icinga-wm>	 PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:12:18] <icinga-wm>	 PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Puppet has 1 failures
[23:12:26] <mafk>	 oh no, not again...
[23:13:37] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:14:06] <icinga-wm>	 PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:14:07] <icinga-wm>	 PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:14:55] <bblack>	 looks like some kind of 500 server error from strontium
[23:15:06] <Reedy>	 usual suspect then
[23:15:11] <bblack>	 so probably related to recent troubles I haven't really been following with puppetmasters falling over
[23:18:08] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/event-schemas: Bump event-schemas submodule commit to master (duration: 00m 28s)
[23:18:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:18:25] <Reedy>	 mafk: ^ not sure if there's anything for you to obviously test for that
[23:18:51] <mafk>	 Reedy: well, I can do a git submodule update to test it updates to the latest commit
[23:19:05] <Reedy>	 I mean, from the web
[23:19:15] <Reedy>	 Submodule path 'wmf-config/event-schemas': checked out '4db9d40d28d61c53cdbca77059d9a2a6e714af89'
[23:19:18] <mafk>	 I don't think so
[23:19:29] <Reedy>	 https://github.com/wikimedia/mediawiki-event-schemas/commit/4db9d40d28d61c53cdbca77059d9a2a6e714af89
[23:19:33] <Reedy>	 commit hashes match
[23:19:34] <Reedy>	 WFM
[23:19:47] <ebernhardson>	 yea i doubt anything about event-schemas is visible on the web, it would be seen either in eventbus or the monolog->kafka pipeline (depending on what changed)
[23:20:01] <grrrit-wm>	 (03PS8) 10Reedy: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio)
[23:20:17] <mafk>	 now people will have more updated files when cloning the mediawiki-config :)
[23:22:57] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio)
[23:22:58] <icinga-wm>	 PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Puppet has 1 failures
[23:23:30] <grrrit-wm>	 (03Merged) 10jenkins-bot: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio)
[23:24:07] <icinga-wm>	 RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:24:27] <icinga-wm>	 RECOVERY - puppet last run on cp1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:25:32] <logmsgbot>	 !log reedy@tin Synchronized dblists/commonsuploads.dblist: Disabling local uploads on ms.wikipedia.org (duration: 00m 23s)
[23:25:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:25:53] <mafk>	 mw1099?
[23:26:20] <Reedy>	 It's everywhere
[23:26:54] <mafk>	 ok, testing
[23:27:33] <mafk>	 WFM, mswiki's Special:ListGroupRights shows that upload rights have been removed from non-sysops, and upload link in the sidebar now points to commons UploadWizard
[23:27:40] <mafk>	 thank you Sir
[23:30:08] <mafk>	 Nemo_bis: that mswiki upload restriction is now done
[23:34:27] <icinga-wm>	 RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[23:34:36] <icinga-wm>	 RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[23:36:46] <icinga-wm>	 RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[23:36:47] <icinga-wm>	 RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:36:56] <icinga-wm>	 RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[23:36:56] <icinga-wm>	 RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:38:16] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:39:24] <Krinkle>	 Getting on every page mediawiki.org in the console for load.php
[23:39:25] <Krinkle>	 load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector:4 [V5ftWgpAAD8AAXroki4AAABG] 2016-07-26 23:08:10: Fatal exception of type "BadMethodCallException"
[23:40:24] <Krinkle>	 Still happening yp, for about 10 minutes now
[23:40:25] <Krinkle>	 load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector:4 [V5f0cgpAADoAAYz5SAcAAACH] 2016-07-26 23:38:26: Fatal exception of type "BadMethodCallException"
[23:40:42] <Krinkle>	 Reedy: ^
[23:41:26] * Reedy looks on fluorine
[23:42:13] <ebernhardson>	 Reedy: https://logstash.wikimedia.org/app/kibana?#/doc/logstash-*/logstash-2016.07.26/mediawiki/?id=AVYpk6AVJW_HnhxrzSJX
[23:42:29] <Reedy>	 Isn't that more likely .12 related than SWAT?
[23:43:14] <Reedy>	 "Sessions are disabled for this entry point"
[23:43:18] <grrrit-wm>	 (03CR) 10Eevans: restbase-test: setup rsync for data from cassandra-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301303 (owner: 10Dzahn)
[23:43:19] <Reedy>	 thanks ebernhardson
[23:43:33] <legoktm>	 > Kibana is loading. Give me a moment here. I'm loading a whole bunch of code. Don't worry, all this good stuff will be cached up for next time! 
[23:43:50] <Reedy>	 heh
[23:44:04] <Reedy>	 Krinkle: I think you need to poke tgr or anomie as it looks session change related
[23:44:12] <Reedy>	 Possibly only in .12... Depends if/what has changed recently
[23:44:15] <ebernhardson>	 (Sorry these urls are so ugly...), but yes looks like strictly wmf.12 : https://logstash.wikimedia.org/app/kibana?#/dashboard/Fatal-Monitor?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:level,negate:!t,value:NOTICE),query:(match:(level:(query:NOTICE)))),('$state':(store:appState),meta:(alias:!n,disa
[23:44:18] <legoktm>	 it's MobileFrontend
[23:44:22] <ebernhardson>	 err wow, that's worse than i thought
[23:44:34] <ebernhardson>	 (the url, not the problem :P)
[23:44:52] <Reedy>	 Is it worth rolling back to .11 for mw.org etc?
[23:44:54] <Reedy>	 /group0
[23:45:45] <Reedy>	 I'm filing a bug at least
[23:45:55] <anomie>	 Reedy, Krinkle: matt_flaschen was looking at that one earlier. MobileFrontend bug. See https://gerrit.wikimedia.org/r/#/c/301310/.
[23:46:27] <Reedy>	 oh, bug already exists
[23:47:08] <matt_flaschen>	 Yeah, T141386
[23:47:08] <stashbot>	 T141386: onResourceLoaderGetConfigVars can not depend on user-specific info for wikidata settings - https://phabricator.wikimedia.org/T141386
[23:47:56] * Reedy cherry picks to .12
[23:48:42] <Reedy>	 No immediate need to merge to master
[23:48:55] <tgr>	 ebernhardson: FWIW there is a link to the individual logstash record when you open the details, e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2016.07.26/mediawiki?id=AVYpdym414thRtYyfGcl
[23:49:25] <tgr>	 it will still contain all the dashboard config crap but that can be cut off
[23:49:36] <icinga-wm>	 RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:51:12] <ebernhardson>	 tgr: yea i had linked that before, i was trying to link the fatalmonitor dashboard but with a custom filter so it showed all the wikis/deploy versions that had this specific error
[23:51:22] <matt_flaschen>	 Reedy, you're deploying it now?
[23:51:35] <Reedy>	 I'm gonna when jenkins merges it
[23:51:41] <matt_flaschen>	 Okay, thanks.
[23:51:49] <ebernhardson>	 tgr: turns out though there is a 'share' button which initially has the ugly url, but then clicking the button that kinda looks like -><- generates a short (er) url
[23:59:04] <logmsgbot>	 !log reedy@tin Synchronized php-1.28.0-wmf.12/extensions/MobileFrontend/: Deploy revert for group0 for T141386 (duration: 00m 30s)
[23:59:05] <stashbot>	 T141386: onResourceLoaderGetConfigVars can not depend on user-specific info for wikidata settings - https://phabricator.wikimedia.org/T141386
[23:59:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:59:18] <Reedy>	 Krinkle: ^^