[00:00:39] (03CR) 10Paladox: "Applied at http://gerrit-test.wmflabs.org/gerrit/#/c/32/2/tests/fixtures/layout-cloner.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/301027 (owner: 10Paladox) [00:06:23] !log restbase deploy ae5fbac to staging [00:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:11:48] (03PS2) 10Paladox: gerrit: Fix the css for inline diff [puppet] - 10https://gerrit.wikimedia.org/r/301027 [00:23:21] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, and 3 others: Unable to delete file pages on commons: MWException/LocalFileLockError: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2494394 (10aaron) 05Open>03Resolved [00:26:11] (03CR) 10Dzahn: [C: 032] "yes thanks! i noticed that wrapping issue too. that should be https://bugs.chromium.org/p/gerrit/issues/detail?id=4292" [puppet] - 10https://gerrit.wikimedia.org/r/301027 (owner: 10Paladox) [00:26:41] mutante: We still gotta merge my rsync thingie. [00:26:51] If we want that to apply too :) [00:28:05] :) [00:28:16] (03CR) 10Paladox: "thanks." [puppet] - 10https://gerrit.wikimedia.org/r/301027 (owner: 10Paladox) [00:28:29] (03PS3) 10Dzahn: Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 (owner: 10Chad) [00:28:45] ostriches: yes, i meant to do that earlier, just got back later [00:28:51] k :) [00:29:05] (03CR) 10Dzahn: [C: 032] Rsyncd: Allow ensure => absent on config files [puppet] - 10https://gerrit.wikimedia.org/r/300935 (owner: 10Chad) [00:30:43] ok, expecting a recovery and line wraps in a moment [00:30:49] thankyou [00:30:50] :) [00:31:35] it just removed the ferm rule for rsync [00:31:47] here is the example link we had [00:31:49] } elseif ( $action['type'] == 'unknown-signed-addition' ) { [00:31:55] eh, wrong paste [00:32:02] https://gerrit.wikimedia.org/r/#/c/301033/2/tests/fixtures/layout-cloner.yaml [00:32:14] this shows how it wraps now instead of cutting the line off [00:32:35] :) [00:32:56] (03PS1) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 [00:33:32] paladox: thanks [00:33:42] your welcome :) [00:34:03] (03CR) 10jenkins-bot: [V: 04-1] Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (owner: 10Yuvipanda) [00:34:21] (03PS2) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 [00:36:00] (03CR) 10jenkins-bot: [V: 04-1] Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (owner: 10Yuvipanda) [00:37:06] urandom: do you want 2008-c enabled now [00:37:15] i might be too late [00:37:33] (03PS4) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [00:38:03] mutante: uh, yeah, if you want [00:38:31] (03PS2) 10Dzahn: Enable Cassandra instance restbase2008-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300942 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [00:38:52] (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase2008-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/300942 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [00:39:20] (03PS3) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 [00:40:29] urandom: you can go ahead now [00:40:40] mutante: great, thanks! [00:41:00] yw [00:44:43] PROBLEM - MariaDB Slave Lag: s2 on db1036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1125.73 seconds [00:45:00] PROBLEM - MariaDB Slave Lag: s2 on db2035 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 808.86 seconds [00:45:15] (03CR) 10Dzahn: "you can just allow "/bin/journalctl *" like we do in other admin groups. the wildcard in the middle of it is not going to mean it's limite" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [00:45:17] (03PS5) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [00:48:58] !log T134016: Bootstrapping restbase2008-c.codfw.wmnet [00:49:00] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [00:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:49:09] RECOVERY - MariaDB Slave Lag: s2 on db2035 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [00:49:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 605 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5090989 keys - replication_delay is 605 [00:52:54] RECOVERY - MariaDB Slave Lag: s2 on db1036 is OK: OK slave_sql_lag Replication lag: 0.31 seconds [00:53:37] !log lead - stopped rsyncd [00:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:55:29] bblack: So in regards to https://gerrit.wikimedia.org/r/#/c/296634/ - I'm not really sure what the procedures are for getting changes to varnish deployed. Should I sign that patch up for puppet swap? or do something else? [00:58:36] ACKNOWLEDGEMENT - swift-account-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor daniel_zahn known maintenance [00:58:36] ACKNOWLEDGEMENT - swift-account-reaper on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper daniel_zahn known maintenance [00:58:36] ACKNOWLEDGEMENT - swift-account-replicator on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator daniel_zahn known maintenance [00:58:36] ACKNOWLEDGEMENT - swift-account-server on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server daniel_zahn known maintenance [00:58:36] ACKNOWLEDGEMENT - swift-container-auditor on ms-be3001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor daniel_zahn known maintenance [00:58:44] oops, not intended to kill it [00:58:55] but will be back [01:00:19] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5087128 keys - replication_delay is 0 [01:03:24] ACKNOWLEDGEMENT - Check size of conntrack table on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:24] ACKNOWLEDGEMENT - DPKG on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:24] ACKNOWLEDGEMENT - Disk space on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:24] ACKNOWLEDGEMENT - MD RAID on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:24] ACKNOWLEDGEMENT - NTP on ganeti1004 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:24] ACKNOWLEDGEMENT - configured eth on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:24] ACKNOWLEDGEMENT - dhclient process on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:25] ACKNOWLEDGEMENT - ganeti-confd running on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:25] ACKNOWLEDGEMENT - ganeti-mond running on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:26] ACKNOWLEDGEMENT - ganeti-noded running on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:26] ACKNOWLEDGEMENT - puppet last run on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:03:27] ACKNOWLEDGEMENT - salt-minion processes on ganeti1004 is CRITICAL: Connection refused by host daniel_zahn https://phabricator.wikimedia.org/T138414 [01:04:22] 06Operations: eqiad: Install SSD's into ganeti hosts - https://phabricator.wikimedia.org/T138414#2399490 (10Dzahn) ganeti1004 showed up in Icinga. expired downtime. ACKed [01:05:21] (03PS1) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) [01:06:24] (03PS2) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) [01:07:28] (03CR) 10Chad: [C: 031] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [01:08:10] PROBLEM - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: Connection refused [01:10:25] (03PS3) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) [01:10:51] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: Connection refused daniel_zahn bootstrapping T134016 [01:11:46] !log deploying from 2d9817b to a291da1 for ores in scb nodes [01:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:14:05] deployed in canary [01:14:08] it was okay [01:21:03] okay, Everything was fine and it's still fine [01:21:07] I call it a victory [01:21:47] i heard the new thing is that there is swift in labs [01:21:52] oops, wrong channel [01:22:01] Amir1: :) [01:22:21] :) [01:23:39] hey mutante, we have this new dashboard on jobs the ORES extension makes to the service, and how many of them fails: https://grafana.wikimedia.org/dashboard/db/ores-extension I thought you might find this interesting :) [01:24:20] (zoom out, you will see funny things :D) [01:28:48] Amir1: that doesnt look so bad. if i zoom out far enough just a spike in the beginning [01:28:58] 4% failure rate recently ? [01:29:27] it's 1% since the last week and deployment of a new config for wikidata [01:29:40] (03PS4) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) [01:29:52] Amir1: :) [01:29:53] the first three spikes were when the model had issues with Dutch Wikipedia [01:30:01] 400 errors per minute [01:30:53] (03PS5) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) [01:31:22] the funny thing is that ores extension retries failed jobs 30 times, in some cases ores never give score (e.g. edit in talk pages in wikidata) I was thinking ORES is an AI so it'll get upset and throws some scores after 20th time :D [01:33:33] (03CR) 10Alex Monk: [C: 031] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [01:33:55] hehe, gotta make it learn about wikidata talk pages [01:36:29] PROBLEM - restbase endpoints health on cerium is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.147, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:37:00] PROBLEM - Restbase root url on cerium is CRITICAL: Connection refused [01:38:59] ^ ignoring that because it's called a test host [01:39:03] shrug [01:39:39] PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: Connection refused [01:40:28] i think a test host should probably not be in monitoring [01:41:23] (03PS4) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (https://phabricator.wikimedia.org/T114063) [01:41:29] (03PS5) 10Yuvipanda: Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (https://phabricator.wikimedia.org/T114063) [01:41:42] (03CR) 10Dereckson: [C: 031] "Looks good to me." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [01:42:40] (03PS1) 10Dzahn: admin: add bcohn to analytics-privatedata, researchers [puppet] - 10https://gerrit.wikimedia.org/r/301045 [01:43:56] (03CR) 10Yuvipanda: [C: 032] Get rid of the LDAP+YAML ENC [puppet] - 10https://gerrit.wikimedia.org/r/301036 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [01:44:11] (03CR) 10Yuvipanda: "Killed with https://gerrit.wikimedia.org/r/#/c/301036/" [puppet] - 10https://gerrit.wikimedia.org/r/296809 (owner: 10Chad) [01:44:59] (03CR) 10Yuvipanda: [C: 032] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [01:45:09] (03PS6) 10Yuvipanda: ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) [01:45:10] PROBLEM - cassandra-a service on cerium is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:45:14] (03CR) 10Yuvipanda: [V: 032] ldap: Kill a bunch of unused scripts [puppet] - 10https://gerrit.wikimedia.org/r/301040 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [01:49:16] (03PS2) 10Dzahn: admin: add bcohn to analytics-privatedata, researchers [puppet] - 10https://gerrit.wikimedia.org/r/301045 (https://phabricator.wikimedia.org/T140449) [01:49:37] (03CR) 10Dzahn: [C: 032] "the other 2 users from the same request are already done. same batch, just completing this" [puppet] - 10https://gerrit.wikimedia.org/r/301045 (https://phabricator.wikimedia.org/T140449) (owner: 10Dzahn) [01:50:22] (03PS3) 10Dzahn: admin: add bcohn to analytics-privatedata, researchers [puppet] - 10https://gerrit.wikimedia.org/r/301045 (https://phabricator.wikimedia.org/T140449) [01:50:32] sniped again [01:50:49] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [01:51:20] RECOVERY - Restbase root url on cerium is OK: HTTP OK: HTTP/1.1 200 - 15273 bytes in 0.014 second response time [01:51:20] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494534 (10Dzahn) a:03Dzahn [01:51:37] !log cerium testing is over? [01:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:53:37] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2494539 (10Dzahn) a:03Dzahn [01:53:39] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2494540 (10Dzahn) a:05Jgreen>03Dzahn [01:53:41] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2494542 (10Dzahn) a:05Jgreen>03Dzahn [01:55:30] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 4 failures [01:55:40] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494547 (10Dzahn) [01:55:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2494545 (10Dzahn) 05Open>03Resolved user has been created on bastions, stat1002 (and elsewhere wher... [01:55:50] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2494548 (10Dzahn) [01:56:50] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Dzahn) [01:56:52] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2494549 (10Dzahn) 05Open>03Resolved user has been created on bastion hosts, stat1002 and other places where the group is u... [02:00:45] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494555 (10Dzahn) [02:00:47] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2494553 (10Dzahn) 05Open>03Resolved user has been created on bastion hosts, stat1002 and other places where t... [02:02:06] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2441876 (10Dzahn) 05Open>03Resolved all 3 users have been created now. details in subtasks [02:02:31] (03PS1) 10Yuvipanda: ldap: Setup ldapvi + make a new role! [puppet] - 10https://gerrit.wikimedia.org/r/301046 [02:03:49] RECOVERY - cassandra-a service on cerium is OK: OK - cassandra-a is active [02:04:00] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:04:20] RECOVERY - cassandra-a CQL 10.64.16.153:9042 on cerium is OK: TCP OK - 0.005 second response time on port 9042 [02:05:51] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2494563 (10Dzahn) @ellery all 3 users have been created now. please follow-up with them so they get to the data they need ( --> T140399#2483375) [02:10:10] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [02:12:10] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [02:17:27] (03PS1) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 [02:19:04] (03CR) 10jenkins-bot: [V: 04-1] ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (owner: 10Yuvipanda) [02:19:09] (03PS2) 10Yuvipanda: ldap: Setup ldapvi + make a new role! [puppet] - 10https://gerrit.wikimedia.org/r/301046 [02:19:16] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Setup ldapvi + make a new role! [puppet] - 10https://gerrit.wikimedia.org/r/301046 (owner: 10Yuvipanda) [02:21:02] (03PS1) 10Yuvipanda: ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 [02:23:44] jouncebot: next [02:23:45] In 12 hour(s) and 36 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1500) [02:24:04] (03CR) 10jenkins-bot: [V: 04-1] ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 (owner: 10Yuvipanda) [02:24:20] (03PS2) 10Yuvipanda: ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 [02:24:22] (03PS2) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 [02:24:40] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [02:25:35] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.11) (duration: 09m 00s) [02:25:36] (03CR) 10jenkins-bot: [V: 04-1] ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 (owner: 10Yuvipanda) [02:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:25:58] (03CR) 10jenkins-bot: [V: 04-1] ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (owner: 10Yuvipanda) [02:26:15] (03PS3) 10Yuvipanda: ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 [02:26:19] (03PS3) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 [02:27:07] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove conflicting ldapvi package [puppet] - 10https://gerrit.wikimedia.org/r/301049 (owner: 10Yuvipanda) [02:30:50] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:31:44] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jul 26 02:31:43 UTC 2016 (duration 6m 8s) [02:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:58] (03PS1) 10Yuvipanda: ldap: Fixup for ldapvi.conf [puppet] - 10https://gerrit.wikimedia.org/r/301050 [02:36:35] (03PS4) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 [02:42:11] (03PS5) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 [02:42:13] (03PS2) 10Yuvipanda: ldap: Fixup for ldapvi.conf [puppet] - 10https://gerrit.wikimedia.org/r/301050 [02:42:15] (03PS1) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 [02:42:57] (03CR) 10Yuvipanda: [C: 032] ldap: Fixup for ldapvi.conf [puppet] - 10https://gerrit.wikimedia.org/r/301050 (owner: 10Yuvipanda) [02:47:11] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [02:50:59] (03PS2) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) [02:51:02] (03PS6) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) [02:52:55] (03PS1) 10Yuvipanda: ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) [03:10:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 22 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:15:20] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [03:16:30] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 17 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:26:49] (03PS1) 10Yuvipanda: WIP replacement of modify-ldap-groups [puppet] - 10https://gerrit.wikimedia.org/r/301058 [03:31:19] (03PS1) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) [03:39:25] PROBLEM - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 10.65.0.24 [03:41:15] RECOVERY - Juniper alarms on asw-d-eqiad.mgmt.eqiad.wmnet is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [03:43:14] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [03:47:14] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [03:57:37] (03PS1) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) [04:27:36] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:32:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [04:34:01] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [04:37:51] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:51:28] (03PS1) 10Tim Starling: Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 [04:52:36] (03CR) 10jenkins-bot: [V: 04-1] Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 (owner: 10Tim Starling) [04:53:52] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [05:00:01] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [05:13:34] mmmm kafka2002 looks weird [05:17:27] ah issues with kafka 0.9 after the upgrade [05:19:37] very weird, will keep an eye on it [05:21:32] (03PS2) 10Tim Starling: Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 [05:30:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:36:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:45:29] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2494691 (10Dzahn) [05:46:01] 06Operations, 13Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) [05:46:03] 06Operations: reinstall maps-test200[1234] with RAID - https://phabricator.wikimedia.org/T140440#2494692 (10Dzahn) 05Open>03Invalid oh, thanks @akosiaris for all the details [05:47:50] 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2494695 (10Dzahn) @dschwen Is it fixed now? [05:54:06] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2494696 (10Dzahn) @andrew @yuvipanda as list admins of labs-l and labs-announce, how about the remaining checkbox " Add as mod to labs-l/labs-announce"... [05:57:13] (03CR) 10Ori.livneh: [C: 031] Add Html5Depurate module and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301062 (owner: 10Tim Starling) [06:10:49] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:13:44] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: replace gerrit server (ytterbium) with jessie server (lead) - https://phabricator.wikimedia.org/T125018#2494697 (10Dzahn) decom: https://gerrit.wikimedia.org/r/#/c/300806/ https://gerrit.wikimedia.org/r/#/c/300812/ 01:36 ostriches:... [06:16:49] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:18:08] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2494702 (10Dzahn) Alright, found the shell name associated with the wikitech user is "jk" using ldapsearch. Added jk to the LDAP group called "nda". ldapsear... [06:20:20] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2494703 (10Dzahn) 05Open>03Resolved the grafana-admin part of this request should be resolved now. i am not sure about the WebRequestLogs , @gehel do you... [06:20:51] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2494705 (10Dzahn) 05Resolved>03Open [06:32:20] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:10] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:20] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [06:37:58] <_joe_> elukey: you're already on it right? [06:38:13] <_joe_> (eventbus) [06:39:20] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [06:42:34] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2494722 (10Joe) [06:42:36] hey _joe_ sorry I was commuting, will double check in a bit [06:42:50] <_joe_> elukey: it recovered in the meantime [06:42:51] I am sure it is due to the kafka 0.9 migration [06:43:00] it seems flapping from this morning :/ [06:51:01] so kafka seems ok [06:51:23] but eventbus has some issues pushing messages to it [06:51:41] that are only test events since this is main-codfw [06:52:00] so might be related to the python kafka client that doesn't like kafka 0.9 [06:52:18] ah yes: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [06:52:19] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2494729 (10Joe) [06:56:30] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:12:28] (03PS3) 10Tim Starling: Add Html5Depurate module and role [puppet] - 10https://gerrit.wikimedia.org/r/301062 [07:12:48] (03PS1) 10Giuseppe Lavagetto: kafka_config: sort arrays as well [puppet] - 10https://gerrit.wikimedia.org/r/301070 (https://phabricator.wikimedia.org/T141242) [07:12:50] (03PS1) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) [07:14:19] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [07:16:08] (03CR) 10Tim Starling: Add Html5Depurate module and role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301062 (owner: 10Tim Starling) [07:20:04] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2494769 (10MoritzMuehlenhoff) >>! In T140422#2493224, @madhuvishy wrote: > @MoritzMuehlenhoff Done! http://keys.gnupg.net/pks/lookup?op=get&search=0xA4D... [07:38:46] !log Update cxserver to 447a6c9 [07:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:48:55] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10Jan_Dittrich) [07:50:09] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:52:09] (03CR) 10Muehlenhoff: [C: 031] "Looks fine now" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [07:53:28] (03CR) 10Muehlenhoff: "But maybe drop the notes on labtest and the corp mirror which have been clarified on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [07:56:09] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:02:07] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10elukey) Hi Jan, adding https://wikitech.wikimedia.org/wiki/Analytics/Data_access to this task as reference but you have probably already seen it. As far as I can... [08:08:02] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 155574 MB (3% inode=99%) [08:09:54] (03PS1) 10Gilles: Update Thumbor configuration for python-thumbor-wikimedia 1.0.5 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) [08:10:32] (03CR) 10Gilles: [C: 04-1] "python-thumbor-wikimedia 1.0.5 needs to be packaged and uploaded to our repo first" [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles) [08:11:26] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2494904 (10Gilles) [08:11:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2437365 (10Gilles) [08:14:24] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2494922 (10Gilles) [08:15:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:16:59] 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2494926 (10MoritzMuehlenhoff) >>! In T141277#2493220, @chasemp wrote: > @andrew and I were discussing whether LVS would make sense in front of LDAP with the ability to more in... [08:21:41] 06Operations, 06Labs, 10Labs-Infrastructure: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492736 (10Joe) If what happened the other day was that one VM was overloaded and stopped answering ldap queries while still accepting connections (which given the described p... [08:21:42] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:22:01] RECOVERY - Disk space on fluorine is OK: DISK OK [08:25:30] (03CR) 10Muehlenhoff: Minor tweaks to 2.12.2 package (031 comment) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [08:38:14] (03CR) 10Hashar: remove ytterbium from puppet, update gerrit comment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300806 (owner: 10Dzahn) [08:43:02] (03PS2) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) [08:44:07] (03CR) 10Giuseppe Lavagetto: [C: 032] kafka_config: sort arrays as well [puppet] - 10https://gerrit.wikimedia.org/r/301070 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [08:44:24] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [08:48:52] (03PS2) 10Giuseppe Lavagetto: Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [08:54:39] (03PS1) 10Jcrespo: Delete coredb_mysql module [puppet] - 10https://gerrit.wikimedia.org/r/301076 [08:54:42] 06Operations, 06Discovery, 06Labs, 10Labs-Infrastructure, and 2 others: Update coastline data in OSM postgres db (osmdb.eqiad.wmnet) - https://phabricator.wikimedia.org/T140296#2494999 (10dschwen) 05Open>03Resolved Yes! Many thanks! [08:57:29] (03CR) 10Jcrespo: "This is technical debt; this module had been deprecated by the mariadb one, but some hosts continued using it until last week." [puppet] - 10https://gerrit.wikimedia.org/r/301076 (owner: 10Jcrespo) [08:57:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Here you are modifying the default file we use via" [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [09:04:12] (03PS3) 10Jcrespo: Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) [09:05:55] (03PS4) 10Jcrespo: Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) [09:08:33] (03PS5) 10Jcrespo: Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) [09:10:33] (03CR) 10Jcrespo: [C: 04-2] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo) [09:10:43] (03CR) 10Jcrespo: [C: 032] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo) [09:10:54] ^I clicked the wrong button [09:11:11] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [09:11:32] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495057 (10fgiunchedi) @elukey sounds great, thanks! mw1170 and mw1171 would do I think? [09:12:22] (03PS2) 10Filippo Giunchedi: rsyslog: temporarily lower centralserver retention [puppet] - 10https://gerrit.wikimedia.org/r/300833 (https://phabricator.wikimedia.org/T139612) [09:18:39] !log updating debhelper, cdbs, devscripts, libintl-perl, libmodule-build-perl and libnet-dns-perl on jessie systems for compatibility with perl security update [09:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:50] (03CR) 10Filippo Giunchedi: [C: 032] rsyslog: temporarily lower centralserver retention [puppet] - 10https://gerrit.wikimedia.org/r/300833 (https://phabricator.wikimedia.org/T139612) (owner: 10Filippo Giunchedi) [09:29:41] (03PS3) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) [09:30:18] (03PS1) 10Jcrespo: Add provisional my.cnf for new labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/301081 [09:31:04] (03PS2) 10Jcrespo: Add provisional my.cnf for new labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/301081 [09:32:20] (03CR) 10Jcrespo: [C: 032] Add provisional my.cnf for new labsdb replicas [puppet] - 10https://gerrit.wikimedia.org/r/301081 (owner: 10Jcrespo) [09:33:21] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:04] (03PS1) 10Jcrespo: Fix template references for replica db config [puppet] - 10https://gerrit.wikimedia.org/r/301082 [09:39:15] (03PS2) 10Jcrespo: Fix template references for replica db config [puppet] - 10https://gerrit.wikimedia.org/r/301082 [09:39:42] RECOVERY - swift-container-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:39:42] RECOVERY - swift-account-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:39:52] RECOVERY - swift-container-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:39:52] RECOVERY - swift-object-server on ms-be3001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:40:04] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495090 (10Addshore) >>! In T141339#2494880, @elukey wrote: > Hi Jan, > > adding https://wikitech.wikimedia.org/wiki/Analytics/Data_access to this task as reference but you... [09:40:13] RECOVERY - swift-container-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:40:23] RECOVERY - swift-object-updater on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:40:53] RECOVERY - swift-account-auditor on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:41:02] RECOVERY - swift-account-reaper on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:41:03] RECOVERY - swift-object-auditor on ms-be3001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:41:23] RECOVERY - swift-account-server on ms-be3001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:41:23] RECOVERY - swift-object-replicator on ms-be3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:41:26] hello ms-be3001 [09:42:08] (03PS1) 10Elukey: Add the permissions_validity_in_ms among the configurable parameters [puppet] - 10https://gerrit.wikimedia.org/r/301083 (https://phabricator.wikimedia.org/T140869) [09:42:16] :( sorry about that [09:42:26] the recoveries will notify anyways [09:43:03] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:43:22] RECOVERY - swift-container-server on ms-be3004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:43:22] RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:43:31] RECOVERY - swift-account-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:43:31] RECOVERY - swift-container-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:43:42] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:43:42] RECOVERY - swift-account-reaper on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:43:42] RECOVERY - swift-object-auditor on ms-be3004 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:43:42] RECOVERY - swift-object-updater on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:43:43] RECOVERY - swift-container-updater on ms-be3004 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:43:43] RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:43:52] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:43:52] RECOVERY - swift-account-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:44:09] RECOVERIES == good [09:44:18] happy people [09:44:21] RECOVERY - swift-account-server on ms-be3002 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:44:51] (03CR) 10Jcrespo: [C: 032] Fix template references for replica db config [puppet] - 10https://gerrit.wikimedia.org/r/301082 (owner: 10Jcrespo) [09:45:03] RECOVERY - swift-object-replicator on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:45:03] RECOVERY - swift-account-server on ms-be3004 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:45:04] RECOVERY - swift-container-auditor on ms-be3002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:45:21] RECOVERY - swift-object-server on ms-be3002 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:46:12] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: puppet fail [09:52:43] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2495105 (10Addshore) >>! In T140911#2494703, @Dzahn wrote: > the grafana-admin part of this request should be resolved now. i am not sure about the WebReques... [09:54:57] (03CR) 10Muehlenhoff: "Actually after revisiting the Grafana dashboard, the current memory consumption rather depletes approx every 7-9 days (possibly be increas" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [09:56:54] (03PS1) 10Jcrespo: Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 [09:57:39] (03CR) 10jenkins-bot: [V: 04-1] Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 (owner: 10Jcrespo) [09:58:27] (03PS2) 10Filippo Giunchedi: site: add prometheus::node_exporter to more machines [puppet] - 10https://gerrit.wikimedia.org/r/299970 (https://phabricator.wikimedia.org/T140646) [10:00:26] (03PS2) 10Jcrespo: Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 [10:01:38] (03CR) 10Jcrespo: [C: 032] Move replica templates to the top level [puppet] - 10https://gerrit.wikimedia.org/r/301085 (owner: 10Jcrespo) [10:02:49] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As already stated, I would rather use something like a hourly cron script that restarts openldap when the used memory reaches N % of the t" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [10:03:35] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: puppet fail [10:04:55] (03PS1) 10Jcrespo: More the template to the right subdir [puppet] - 10https://gerrit.wikimedia.org/r/301088 [10:05:11] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495149 (10elukey) @fgiunchedi looks good to me, the CPU utilization across the cluster seems very good and these hosts don't seem to have special properties... [10:06:26] (03CR) 10Jcrespo: [C: 032] More the template to the right subdir [puppet] - 10https://gerrit.wikimedia.org/r/301088 (owner: 10Jcrespo) [10:08:58] (03PS1) 10ArielGlenn: add full paths to config files for pagetitles dump from cron [puppet] - 10https://gerrit.wikimedia.org/r/301090 [10:11:26] (03CR) 10Alexandros Kosiaris: "Regarding the comment in the commit message about corp LDAP being hit by this, funnily enough, no, the corp replica is not hit by the issu" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [10:12:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The graph in https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen&from=1468702800000&to=1469307599999&var-server=" [puppet] - 10https://gerrit.wikimedia.org/r/300902 (https://phabricator.wikimedia.org/T130593) (owner: 10Dzahn) [10:15:08] (03PS2) 10ArielGlenn: add full paths to config files for pagetitles dump from cron [puppet] - 10https://gerrit.wikimedia.org/r/301090 [10:16:23] (03PS1) 10Jcrespo: Set labsdb replicas to have its datadir on /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/301092 [10:16:26] <_joe_> win 25 [10:16:59] (03CR) 10ArielGlenn: [C: 032] add full paths to config files for pagetitles dump from cron [puppet] - 10https://gerrit.wikimedia.org/r/301090 (owner: 10ArielGlenn) [10:18:05] (03PS2) 10Jcrespo: Set labsdb replicas to have its datadir on /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/301092 [10:19:32] (03CR) 10Jcrespo: [C: 032] Set labsdb replicas to have its datadir on /srv/sqldata [puppet] - 10https://gerrit.wikimedia.org/r/301092 (owner: 10Jcrespo) [10:21:35] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:25:34] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495203 (10fgiunchedi) correction: I meant new hardware from one of the pools, i.e. any machine added in https://gerrit.wikimedia.org/r/#/c/290236/ so mw1291... [10:38:18] (03PS1) 10Jcrespo: Update regex to include new labsdb and proxy machines [puppet] - 10https://gerrit.wikimedia.org/r/301095 [10:43:08] (03PS1) 10ArielGlenn: move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) [10:43:31] !log restarting cassandra on aqs100[456] instances (not serving live traffic) [10:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:43:41] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [10:45:22] (03PS4) 10Giuseppe Lavagetto: puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) [10:46:10] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [10:46:37] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: use LANG from /etc/default/locale, not C [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [10:51:48] (03Draft1) 10Addshore: beta wgEchoMentionStatusNotifications default true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301098 (https://phabricator.wikimedia.org/T140234) [10:52:08] (03CR) 10Addshore: [C: 04-1] "Not to be merged until the code / depends-on patch is merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301098 (https://phabricator.wikimedia.org/T140234) (owner: 10Addshore) [10:53:50] (03CR) 10Filippo Giunchedi: [C: 031] Update regex to include new labsdb and proxy machines [puppet] - 10https://gerrit.wikimedia.org/r/301095 (owner: 10Jcrespo) [10:54:24] (03PS1) 10Jcrespo: Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 [10:55:10] (03PS2) 10Jcrespo: Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 [10:58:55] (03PS3) 10Jcrespo: Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 [10:59:31] (03CR) 10Jcrespo: [C: 032] Ignore trace filesystems on disk check [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/301099 (owner: 10Jcrespo) [10:59:48] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495260 (10Joe) @fgiunchedi I would recommend, as long as this is just experimental, to repurpose mw1152 and then mw1153-60 which are the old imagescalers an... [11:01:34] (03PS1) 10Jcrespo: Update mariadb module to merge new disk_check fix [puppet] - 10https://gerrit.wikimedia.org/r/301100 [11:01:48] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495263 (10elukey) @Nuria any objection to this access request? [11:02:03] (03PS2) 10Jcrespo: Update mariadb module to merge new disk_check fix [puppet] - 10https://gerrit.wikimedia.org/r/301100 [11:03:32] (03CR) 10Jcrespo: [C: 032] Update mariadb module to merge new disk_check fix [puppet] - 10https://gerrit.wikimedia.org/r/301100 (owner: 10Jcrespo) [11:06:03] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495265 (10Joe) To be more clear: hardware for the new appservers is not as much in overabundance as we had before; I would suggest not to repurpose machines... [11:08:44] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495267 (10mark) >>! In T139606#2495265, @Joe wrote: > To be more clear: hardware for the new appservers is not as much in overabundance as we had before; I... [11:18:37] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:18:37] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail [11:20:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 20 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:21:57] (03PS1) 10Addshore: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) [11:22:01] (03PS5) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) [11:22:14] (03CR) 10Addshore: [C: 04-1] "Not yet scheduled" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [11:26:05] (03PS1) 10Filippo Giunchedi: claim mw129[12] for thumbor [dns] - 10https://gerrit.wikimedia.org/r/301106 (https://phabricator.wikimedia.org/T139606) [11:26:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 19 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:28:42] !log filippo@palladium conftool action : set/pooled=no; selector: name=mw1291.* [11:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:34] !log filippo@palladium conftool action : set/pooled=no; selector: name=mw1292.* [11:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:30:47] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:31:39] <_joe_> godog: use pooled=inactive when you decide to depool it completely [11:32:59] _joe_: ok thanks, what's the difference? [11:35:01] <_joe_> godog: inactive makes pybal remove the server from its config altoghether [11:39:20] !log filippo@palladium conftool action : set/pooled=inactive; selector: name=mw1292.eqiad.wmnet [11:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:39:27] !log filippo@palladium conftool action : set/pooled=inactive; selector: name=mw1291.eqiad.wmnet [11:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:42:22] (03PS2) 10Gilles: Update Thumbor configuration for python-thumbor-wikimedia 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) [11:43:01] (03CR) 10Gilles: [C: 04-1] "python-thumbor-wikimedia 1.0.6 needs to be packaged and uploaded to our repo first" [puppet] - 10https://gerrit.wikimedia.org/r/301073 (https://phabricator.wikimedia.org/T141337) (owner: 10Gilles) [11:43:09] _joe_: bit of a weird name then? [11:43:18] inactive seems to suggest 'present but idling' [11:47:00] (03PS1) 10Filippo Giunchedi: reclaim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) [11:47:27] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:48:13] !log installing exim4 updates related to perl security release [11:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:56] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201) [12:00:22] hello event bus [12:00:57] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2002 is OK: All endpoints are healthy [12:01:04] will schedule some downtime until ottomata will be online. These are only test events and there might be something weird communicating with kafka 0.9 [12:01:07] (upgraded yesterday) [12:04:07] ah maybe it is due to the new confluent kafka client pkg [12:04:10] not sure [12:15:46] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Puppet has 2 failures [12:16:36] PROBLEM - Disk space on mx2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied [12:17:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:17:59] 06Operations: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2495341 (10MoritzMuehlenhoff) [12:18:23] moritzm: the icinga failure on mx2001 might be related to the upgrades? [12:20:00] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2495354 (10fgiunchedi) >>! In T139606#2495265, @Joe wrote: > I would suggest we pick 2 servers from the scalers pool, so mw1291 and 1292 - at the moment that... [12:22:31] having a look [12:23:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [12:26:39] thanks! [12:26:47] RECOVERY - Disk space on mx2001 is OK: DISK OK [12:27:55] the exim4 update also introduced changes scheduled for the next jessie point release, so this needed a manual puppet run to reinstate our permissions for /var/spool/exim4/scan [12:31:35] 06Operations, 10hardware-requests: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2495365 (10Peachey88) [12:40:37] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:42:10] !log installing perl security updates [12:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 20 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:49:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 18 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [12:56:27] (03PS4) 10Sbisson: Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) [13:07:02] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2495413 (10BBlack) @Nuria - Thanks, sounds awesome :) [13:07:57] (03PS4) 10BBlack: Add Content-Security-Policy to images from test[2]wiki [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [13:08:22] (03CR) 10BBlack: [C: 032 V: 032] Add Content-Security-Policy to images from test[2]wiki [puppet] - 10https://gerrit.wikimedia.org/r/296634 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [13:27:51] (03PS3) 10Elukey: Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) [13:28:06] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [13:28:47] (03CR) 10Elukey: "Reworked following Daniel's suggestion and the ones that came from the ops meeting. Kafka permissions will be added only if needed on a la" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [13:32:01] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2495472 (10Joe) [13:33:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:34:36] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [13:34:39] (03PS1) 10Jcrespo: Set es2001 check options to warning: 5%, critical: 1%, no page [puppet] - 10https://gerrit.wikimedia.org/r/301117 [13:39:58] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 237 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:40:40] (03PS2) 10Jcrespo: Set es2001 disk check options to warning: 5%, critical: 1%, no page [puppet] - 10https://gerrit.wikimedia.org/r/301117 [13:41:45] 06Operations, 10ops-eqiad: re-label mirror1001 to sodium - https://phabricator.wikimedia.org/T141105#2495493 (10Cmjohnson) 05Open>03Resolved [13:43:36] (03CR) 10Jcrespo: [C: 032] Set es2001 disk check options to warning: 5%, critical: 1%, no page [puppet] - 10https://gerrit.wikimedia.org/r/301117 (owner: 10Jcrespo) [13:46:16] RECOVERY - Disk space on es2001 is OK: DISK OK [13:46:30] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2480587 (10AlexMonk-WMF) >>! In T140911#2494702, @Dzahn wrote: > @Jonas You should now be able to login at grafana-admin using your wikitech credentials. It a... [13:46:49] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2495542 (10Joe) I checked all active hosts, and all the differences I found are due basically due to bugs that have been fixed (like dep... [13:47:02] 06Operations, 07Puppet, 13Patch-For-Review, 05Puppet-infrastructure-modernization: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#2495545 (10Joe) [13:47:05] 06Operations, 07Puppet, 05Puppet-infrastructure-modernization: Make the WMF puppet tree compile equally under puppet 3.4 and 3.8 - https://phabricator.wikimedia.org/T141242#2495543 (10Joe) 05Open>03Resolved [13:50:27] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2494814 (10AlexMonk-WMF) 'statistics-users' seems redundant if you've got 'researchers' [13:52:24] !next [13:52:31] jouncebot next [13:52:32] In 1 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1500) [13:54:01] (03PS5) 10MarcoAurelio: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) [13:54:47] (03PS7) 10MarcoAurelio: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) [13:55:01] !log compressing 300GB table on dbstore2002 (expect warnings, slowdown, lag -but it is a passive analytics slave) [13:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] Bacula: Remove old gerrit backup path, unused now (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad) [13:55:20] (03PS4) 10MarcoAurelio: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 [14:06:25] 06Operations, 10ops-eqiad: plug frqueue1001 into pfw1- ge-2/0/11 - https://phabricator.wikimedia.org/T141361#2495599 (10Jgreen) [14:07:55] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495617 (10elukey) >>! In T141339#2495557, @AlexMonk-WMF wrote: > 'statistics-users' seems redundant if you've got 'researchers' I am not super familiar with the exact diff... [14:10:34] (03PS2) 10ArielGlenn: move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) [14:12:12] 06Operations, 10ops-eqiad: Survey available/unused ports on eqiad pfw's - https://phabricator.wikimedia.org/T141363#2495639 (10Jgreen) [14:13:25] (03PS3) 10ArielGlenn: move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) [14:14:34] (03CR) 10Addshore: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [14:16:29] (03CR) 10ArielGlenn: [C: 032] move addschanges dumps to snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301096 (https://phabricator.wikimedia.org/T141282) (owner: 10ArielGlenn) [14:27:42] (03PS1) 10ArielGlenn: fix name of config file for addschanges dump on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301124 [14:29:57] (03PS2) 10Filippo Giunchedi: claim mw129[12] for thumbor [dns] - 10https://gerrit.wikimedia.org/r/301106 (https://phabricator.wikimedia.org/T139606) [14:30:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] claim mw129[12] for thumbor [dns] - 10https://gerrit.wikimedia.org/r/301106 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [14:32:30] !log reimage mw1291 as thumbor1001 [14:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:16] (03CR) 10ArielGlenn: [C: 032] fix name of config file for addschanges dump on snapshot1007 [puppet] - 10https://gerrit.wikimedia.org/r/301124 (owner: 10ArielGlenn) [14:35:08] (03PS6) 10Filippo Giunchedi: puppetization for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/300827 (https://phabricator.wikimedia.org/T139606) [14:36:36] (03PS2) 10Filippo Giunchedi: claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) [14:36:50] !log uploading openjdk-8 security update (8u102-b14-1~bpo8+1) to carbon [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:53] (03CR) 10Filippo Giunchedi: [C: 032] claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [14:38:02] (03PS3) 10Filippo Giunchedi: claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) [14:38:07] (03CR) 10Filippo Giunchedi: [V: 032] claim mw129[12] for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/301108 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [14:39:16] (03PS1) 10ArielGlenn: fix up template location for addschanges config file [puppet] - 10https://gerrit.wikimedia.org/r/301125 [14:41:31] (03PS2) 10ArielGlenn: fix up template location for addschanges config file [puppet] - 10https://gerrit.wikimedia.org/r/301125 [14:42:43] (03PS1) 10Ottomata: Hieraize eventlogging_kafka_handler to allow selection of different kafka clients [puppet] - 10https://gerrit.wikimedia.org/r/301126 [14:42:48] (03CR) 10ArielGlenn: [C: 032] fix up template location for addschanges config file [puppet] - 10https://gerrit.wikimedia.org/r/301125 (owner: 10ArielGlenn) [14:44:15] (03PS2) 10Thcipriani: Use hiera for udp2log-mw logrotate count [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) [14:45:26] (03PS1) 10ArielGlenn: remove dump related classes from snapshot1001 through 1004 [puppet] - 10https://gerrit.wikimedia.org/r/301127 [14:46:39] (03CR) 10ArielGlenn: [C: 032] remove dump related classes from snapshot1001 through 1004 [puppet] - 10https://gerrit.wikimedia.org/r/301127 (owner: 10ArielGlenn) [14:48:36] (03PS1) 10Jforrester: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 [14:48:38] (03PS1) 10Jforrester: Change default gallery mode to 'packed' on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) [14:48:47] (03CR) 10Thcipriani: "Puppet compiler output: https://puppet-compiler.wmflabs.org/3475/" [puppet] - 10https://gerrit.wikimedia.org/r/299672 (https://phabricator.wikimedia.org/T140313) (owner: 10Thcipriani) [14:49:06] (03CR) 10jenkins-bot: [V: 04-1] Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 (owner: 10Jforrester) [14:49:14] (03CR) 10jenkins-bot: [V: 04-1] Change default gallery mode to 'packed' on the English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [14:50:25] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2495768 (10Jseddon) @akosiaris Thank you so much for that link. To be honest you really don't have to try and recover the entire pad. That serv... [14:50:43] 06Operations, 10Wikimedia-Etherpad: Unable to access Etherpad - https://etherpad.wikimedia.org/p/Fundraising_Staff_Feedback - https://phabricator.wikimedia.org/T140886#2495770 (10Jseddon) p:05High>03Lowest [14:51:36] (03PS2) 10Jforrester: Test setting gallery config differently on Beta Cluster enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301128 [14:55:18] (03PS1) 10ArielGlenn: remove snapshot1001 through 1004 from hiera, mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/301131 [14:58:57] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2495785 (10GWicke) Based on the limited data so far, read latency seems pretty much unaffected on either host: {F43... [14:59:16] PROBLEM - Host lutetium is DOWN: PING CRITICAL - Packet loss = 100% [14:59:43] is that the false positive? [14:59:49] it's not exactly a false positive [14:59:52] but yeah, looking into it [14:59:53] because network is dumb? [14:59:59] swat time! [15:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1500). Please do the needful. [15:00:04] stephanebisson, mafk, and Addshore: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] !log installed cr2-eqiad FPC 3 [15:00:16] * mafk present [15:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:23] hello [15:00:44] Who's swatting? [15:00:51] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2495788 (10Dzahn) AlexMonk is right, i said that because of this line " 31 # Require ldap-group cn=nda,ou=groups,dc=wikimedia,dc=org " but did not not... [15:00:55] tyler as usual? :) [15:01:09] I can SWAT today [15:01:49] (03CR) 10Thcipriani: [C: 032] Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) (owner: 10Sbisson) [15:02:15] (03Merged) 10jenkins-bot: Remove EchoBundleEmailInterval [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289395 (https://phabricator.wikimedia.org/T135446) (owner: 10Sbisson) [15:02:17] * addshore is here :) [15:02:54] (03CR) 10ArielGlenn: [C: 032] remove snapshot1001 through 1004 from hiera, mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/301131 (owner: 10ArielGlenn) [15:04:08] (03PS1) 10ArielGlenn: remove obsolete manifests from snapshot module/role [puppet] - 10https://gerrit.wikimedia.org/r/301132 [15:05:10] (03Abandoned) 10BryanDavis: Add LANG to /etc/defaults/puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [15:05:25] RECOVERY - Host lutetium is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [15:06:13] stephanebisson: change is live on mw1099, check with X-Wikimedia-Debug if applicable please [15:06:44] it should be a no-op but I'll test related functionality quickly [15:06:51] ack, thanks [15:07:51] (03CR) 10Elukey: "Looks good! I added some comments but I am not sure if they are relevant.. The zookeeper_url was the only puzzling part since afaiu it sho" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/300879 (https://phabricator.wikimedia.org/T134184) (owner: 10Ottomata) [15:08:35] PROBLEM - ElasticSearch health check for shards on elastic1032 is CRITICAL: CRITICAL - elasticsearch inactive shards 1381 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1303, number_of_pending_tasks: 1533, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 231576, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:35] PROBLEM - ElasticSearch health check for shards on elastic1036 is CRITICAL: CRITICAL - elasticsearch inactive shards 1381 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1303, number_of_pending_tasks: 1534, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 231656, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:36] PROBLEM - ElasticSearch health check for shards on elastic1025 is CRITICAL: CRITICAL - elasticsearch inactive shards 1378 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1300, number_of_pending_tasks: 1546, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 234524, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:36] PROBLEM - ElasticSearch health check for shards on elastic1034 is CRITICAL: CRITICAL - elasticsearch inactive shards 1378 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1300, number_of_pending_tasks: 1546, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 234573, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:36] PROBLEM - ElasticSearch health check for shards on elastic1035 is CRITICAL: CRITICAL - elasticsearch inactive shards 1378 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1300, number_of_pending_tasks: 1546, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 234488, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:42] (03PS6) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [15:08:46] PROBLEM - ElasticSearch health check for shards on elastic1043 is CRITICAL: CRITICAL - elasticsearch inactive shards 1361 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1283, number_of_pending_tasks: 1588, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 244440, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:46] PROBLEM - ElasticSearch health check for shards on elastic1019 is CRITICAL: CRITICAL - elasticsearch inactive shards 1361 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1283, number_of_pending_tasks: 1588, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 244519, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:46] PROBLEM - ElasticSearch health check for shards on elastic1026 is CRITICAL: CRITICAL - elasticsearch inactive shards 1361 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1283, number_of_pending_tasks: 1588, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 244582, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:47] PROBLEM - ElasticSearch health check for shards on elastic1018 is CRITICAL: CRITICAL - elasticsearch inactive shards 1357 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1279, number_of_pending_tasks: 1598, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 246874, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:08:56] PROBLEM - ElasticSearch health check for shards on elastic1030 is CRITICAL: CRITICAL - elasticsearch inactive shards 1349 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1271, number_of_pending_tasks: 1625, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 252274, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:06] PROBLEM - ElasticSearch health check for shards on elastic1038 is CRITICAL: CRITICAL - elasticsearch inactive shards 1342 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1264, number_of_pending_tasks: 1669, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 262536, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:06] PROBLEM - ElasticSearch health check for shards on elastic1047 is CRITICAL: CRITICAL - elasticsearch inactive shards 1342 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1264, number_of_pending_tasks: 1670, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 262600, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:06] PROBLEM - ElasticSearch health check for shards on elastic1020 is CRITICAL: CRITICAL - elasticsearch inactive shards 1342 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1264, number_of_pending_tasks: 1670, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 262793, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:06] looking ^ [15:09:06] PROBLEM - ElasticSearch health check for shards on elastic1033 is CRITICAL: CRITICAL - elasticsearch inactive shards 1341 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1263, number_of_pending_tasks: 1680, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 266031, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:07] PROBLEM - ElasticSearch health check for shards on elastic1037 is CRITICAL: CRITICAL - elasticsearch inactive shards 1341 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1263, number_of_pending_tasks: 1680, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 266280, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:13] thcipriani: all good [15:09:15] PROBLEM - ElasticSearch health check for shards on elastic1023 is CRITICAL: CRITICAL - elasticsearch inactive shards 1340 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1262, number_of_pending_tasks: 1683, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 267184, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:15] PROBLEM - ElasticSearch health check for shards on elastic1045 is CRITICAL: CRITICAL - elasticsearch inactive shards 1338 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1260, number_of_pending_tasks: 1690, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 270412, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:15] PROBLEM - ElasticSearch health check for shards on elastic1028 is CRITICAL: CRITICAL - elasticsearch inactive shards 1338 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1260, number_of_pending_tasks: 1690, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 270432, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_perce [15:09:18] o.O [15:09:25] PROBLEM - ElasticSearch health check for shards on elastic1044 is CRITICAL: CRITICAL - elasticsearch inactive shards 1333 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1255, number_of_pending_tasks: 10, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1296, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_a [15:09:26] PROBLEM - ElasticSearch health check for shards on elastic1039 is CRITICAL: CRITICAL - elasticsearch inactive shards 1330 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1252, number_of_pending_tasks: 19, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4301, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_a [15:09:36] PROBLEM - ElasticSearch health check for shards on elastic1021 is CRITICAL: CRITICAL - elasticsearch inactive shards 1326 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1248, number_of_pending_tasks: 63, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 13253, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_ [15:09:39] <_joe_> gehel: any idea what happened? [15:09:45] PROBLEM - ElasticSearch health check for shards on elastic1022 is CRITICAL: CRITICAL - elasticsearch inactive shards 1321 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1243, number_of_pending_tasks: 105, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 20430, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:09:45] PROBLEM - ElasticSearch health check for shards on elastic1029 is CRITICAL: CRITICAL - elasticsearch inactive shards 1321 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1243, number_of_pending_tasks: 105, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 20532, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:09:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 32, dormant: 0, excluded: 1, unused: 0BRxe-3/3/1: down - BRxe-3/1/4: down - BRxe-3/0/1: down - BRxe-3/2/3: down - BRxe-3/3/3: down - BRxe-3/3/5: down - BRxe-3/0/3: down - BRxe-3/0/5: down - BRxe-3/2/1: down - BRxe-3/1/7: down - BRxe-3/3/6: down - BRxe-3/1/3: down - BRxe-3/1/2: down - BRxe-3/1/5: down - BRxe-3/1/0: down - BRxe [15:10:05] PROBLEM - ElasticSearch health check for shards on elastic1027 is CRITICAL: CRITICAL - elasticsearch inactive shards 1307 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1229, number_of_pending_tasks: 160, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 37384, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:05] PROBLEM - ElasticSearch health check for shards on elastic1041 is CRITICAL: CRITICAL - elasticsearch inactive shards 1307 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1229, number_of_pending_tasks: 160, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 37626, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:07] PROBLEM - ElasticSearch health check for shards on elastic1031 is CRITICAL: CRITICAL - elasticsearch inactive shards 1300 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1222, number_of_pending_tasks: 173, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 44521, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:07] PROBLEM - ElasticSearch health check for shards on elastic1046 is CRITICAL: CRITICAL - elasticsearch inactive shards 1300 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1222, number_of_pending_tasks: 173, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 44562, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:07] _joe_ he is out [15:10:16] <_joe_> oh, well [15:10:16] but there is a PROBLEM - Router interfaces on cr2-eqiad [15:10:16] PROBLEM - ElasticSearch health check for shards on elastic1040 is CRITICAL: CRITICAL - elasticsearch inactive shards 1291 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1213, number_of_pending_tasks: 190, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 54045, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:25] PROBLEM - ElasticSearch health check for shards on elastic1042 is CRITICAL: CRITICAL - elasticsearch inactive shards 1284 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1206, number_of_pending_tasks: 204, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 60337, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:26] PROBLEM - ElasticSearch health check for shards on elastic1024 is CRITICAL: CRITICAL - elasticsearch inactive shards 1281 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1203, number_of_pending_tasks: 225, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 63933, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:26] PROBLEM - ElasticSearch health check for shards on elastic1017 is CRITICAL: CRITICAL - elasticsearch inactive shards 1281 threshold =0.1% breach: status: yellow, number_of_nodes: 31, unassigned_shards: 1203, number_of_pending_tasks: 225, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 64000, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent [15:10:30] _joe_: still unclear but network issue I suppose [15:10:44] (03PS7) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [15:11:03] no [15:11:06] no network issues [15:11:08] ostriches, hi would you be able to review https://gerrit.wikimedia.org/r/#/c/301001/ please. [15:11:30] some nodes stopped to talk to each other [15:11:33] <_joe_> dcausse: let's switch to codfw and debug with ease? [15:11:58] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:289395|Remove EchoBundleEmailInterval (T135446)]] PART I (duration: 00m 34s) [15:11:59] thcipriani: don't know if you've seen my response with all those alerts, all good with the config patch [15:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:09] T135446: Unsubstituted message footer in mail notification of thanks - https://phabricator.wikimedia.org/T135446 [15:12:19] <_joe_> thcipriani: please stop swatting if you see things exploding in here [15:12:27] (03PS2) 10Chad: Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 [15:12:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:289395|Remove EchoBundleEmailInterval (T135446)]] PART II (duration: 00m 26s) [15:12:32] T135446: Unsubstituted message footer in mail notification of thanks - https://phabricator.wikimedia.org/T135446 [15:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:42] ^ stephanebisson deployed everywhere. [15:12:43] _joe_: don't know... search is still working, let me have a quick look at how many shards were unassigned [15:12:45] _joe_: ack, stopping [15:12:52] <_joe_> ok [15:13:08] (03PS2) 10ArielGlenn: remove obsolete manifests from snapshot module/role [puppet] - 10https://gerrit.wikimedia.org/r/301132 [15:13:12] <_joe_> thcipriani: rationale is: a) we don't want to overlap effects b) we might need to do a very fast sync-file [15:14:16] _joe_: yup, makes sense. change had been pushed to test and pulled down to deployment master, wanted to make sure we were in an consistant state before pausing. [15:14:17] <_joe_> dcausse: have you tried restarting one of the segregated nodes? [15:14:28] <_joe_> thcipriani: yeah agreed! [15:15:59] <_joe_> I see the shards are recovering [15:16:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 1, unused: 0 [15:16:10] _joe_: they are all back [15:16:15] <_joe_> that's because the partition wasn't as bad as could've been [15:16:16] recovering [15:16:25] (this alert was spurious and it's recovering because I just told it to ignore those interfaces, fwiw) [15:16:32] trying to understand why the master dropped 6 nodes suddenly [15:16:44] <_joe_> dcausse: what node is the master? [15:16:51] we had a network flap, but it must have lasted for less than 3-4 seconds [15:17:08] _joe_: elastic1030 [15:17:13] I can do it again if you want [15:17:15] <_joe_> paravoid: I guess all servers in a row were dropped? [15:17:16] just a master swap [15:17:19] <_joe_> paravoid: not now :) [15:17:27] <_joe_> btw let me check something [15:17:30] paravoid: that's not really a flap is it [15:17:57] <_joe_> restbase didn't crash [15:18:07] (03CR) 10ArielGlenn: [C: 032] remove obsolete manifests from snapshot module/role [puppet] - 10https://gerrit.wikimedia.org/r/301132 (owner: 10ArielGlenn) [15:18:32] it was really really short [15:18:44] nothing else even noticed really [15:18:48] paravoid: that might be the cause, nodes were removed and re-added in one sec, unfortunaly it was sufficient to cause a recovery [15:19:13] paravoid: why would that cause unavailability though? [15:19:19] not even 500s noticed, which is one of our most sensitive checks [15:19:28] during a vrrp master switch the backup doesn't go away [15:19:39] <_joe_> dcausse: [2016-07-26 15:05:19,293][WARN ][cluster.routing.allocation.decider] [elastic1030] after allocating, node [7mpr8WTsQIC_6Gvpza0Emg] would have more than the allowed 20% free disk threshold (16.6% free), preventi [15:19:43] <_joe_> ng allocation [15:19:48] <_joe_> is this ok/normal? [15:20:01] (03PS1) 10Muehlenhoff: Remove old trusty scalers from conftool-data and dsh [puppet] - 10https://gerrit.wikimedia.org/r/301138 (https://phabricator.wikimedia.org/T141352) [15:20:06] _joe_: it's "normal" yes [15:21:12] I think it's ok to let the cluster recover [15:21:50] <_joe_> dcausse: I agree I was looking at a few logs [15:21:59] yeah, and then do this again a few times :) [15:21:59] <_joe_> and well, you're more expert than me with elasticsearch [15:22:18] <_joe_> I am sure there is some parameter to tune [15:23:56] dropping a node will always cause a decrease in the number of shards and cause icinga alerts, unfortunately the time to recover is not negligible... [15:23:59] Hello! Juste saw the message... Can I do anything to help? [15:24:00] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2495881 (10faidon) We installed the new FPC on cr2-eqiad today — it's now up and online, all of its 32 10G ports. [15:24:01] _joe_: still need the '@'? ;) [15:24:10] gehel: hi! [15:24:20] <_joe_> Luke081515: nope it just remained attached from kicking the troll yesterday [15:24:44] ok :) [15:25:05] thcipriani: if you could please poke me after swat restarts it'd be appreciated, thank you! [15:25:05] <_joe_> not exactly my main priority (kicking trolls out of IRC chans) [15:25:18] mafk: yup, will do. [15:25:20] (03PS1) 10ArielGlenn: move dump related commonly included classes out to common role [puppet] - 10https://gerrit.wikimedia.org/r/301139 [15:25:20] Luke081515: does it matter so much? [15:25:32] thcipriani: same for me please :) [15:25:33] dcausse: do you have an approximate time of when the troubles started? [15:28:25] <_joe_> paravoid: it's in general good irc-netiquette not to keep the op tag when not needed - I agree with that [15:28:34] (03CR) 10ArielGlenn: [C: 032] move dump related commonly included classes out to common role [puppet] - 10https://gerrit.wikimedia.org/r/301139 (owner: 10ArielGlenn) [15:28:39] yeah sure, but whatever [15:28:43] <_joe_> I just didn't notice until Luke081515 told me [15:28:56] <_joe_> oh, yes, that was basically my preceding comment too [15:31:18] (03PS1) 10ArielGlenn: add system::role to the dump related roles for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301142 [15:33:08] (03CR) 10BryanDavis: [C: 031] "If at some point we turn on the ferm firewalls in deployment-prep we won't want to restrict the Logstash ports to the deployment-prep proj" [puppet] - 10https://gerrit.wikimedia.org/r/297376 (owner: 10Muehlenhoff) [15:33:09] 06Operations, 10ops-eqiad: Survey available/unused ports on eqiad pfw's - https://phabricator.wikimedia.org/T141363#2495639 (10Cmjohnson) I did a check on all ports and verified each one. pfw1 0 -> indium 1 -> payment1001 2 -> payment1003 3 -> pay-lvs1001 4 -> pay-lvs1001 eth2 (doesn’t appear to be acti... [15:33:14] lol +1 to whatever on ircops. [15:33:29] (03PS2) 10ArielGlenn: add system::role to the dump related roles for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301142 [15:34:46] RECOVERY - ElasticSearch health check for shards on elastic1041 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 835, number_of_pending_tasks: 29, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4856, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0141720266, [15:34:46] RECOVERY - ElasticSearch health check for shards on elastic1043 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 835, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0141720266, acti [15:34:56] RECOVERY - ElasticSearch health check for shards on elastic1025 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 833, number_of_pending_tasks: 25, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 3687, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0359751444, [15:34:56] RECOVERY - ElasticSearch health check for shards on elastic1034 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 833, number_of_pending_tasks: 26, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 3865, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0359751444, [15:34:56] RECOVERY - ElasticSearch health check for shards on elastic1035 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 833, number_of_pending_tasks: 27, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4046, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0359751444, [15:35:06] RECOVERY - ElasticSearch health check for shards on elastic1018 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 832, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0468767034, acti [15:35:15] RECOVERY - ElasticSearch health check for shards on elastic1047 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, acti [15:35:15] RECOVERY - ElasticSearch health check for shards on elastic1028 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, acti [15:35:16] RECOVERY - ElasticSearch health check for shards on elastic1029 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 273, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, ac [15:35:16] RECOVERY - ElasticSearch health check for shards on elastic1020 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 306, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, ac [15:35:16] RECOVERY - ElasticSearch health check for shards on elastic1044 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 830, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 462, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.0686798212, ac [15:35:27] RECOVERY - ElasticSearch health check for shards on elastic1019 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 827, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.101384498, activ [15:35:36] RECOVERY - ElasticSearch health check for shards on elastic1046 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 827, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.101384498, activ [15:35:41] the number of shards is cluster wide, maybe we don't need to have this check on all nodes... [15:35:55] RECOVERY - ElasticSearch health check for shards on elastic1022 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, acti [15:35:56] RECOVERY - ElasticSearch health check for shards on elastic1040 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1074, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, a [15:35:56] RECOVERY - ElasticSearch health check for shards on elastic1033 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 4, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1769, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, a [15:35:56] RECOVERY - ElasticSearch health check for shards on elastic1031 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 826, number_of_pending_tasks: 5, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 1813, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1122860569, a [15:36:05] RECOVERY - ElasticSearch health check for shards on elastic1045 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti [15:36:05] RECOVERY - ElasticSearch health check for shards on elastic1030 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti [15:36:05] RECOVERY - ElasticSearch health check for shards on elastic1021 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti [15:36:05] RECOVERY - ElasticSearch health check for shards on elastic1023 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 825, number_of_pending_tasks: 1, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1231876158, acti [15:36:06] RECOVERY - ElasticSearch health check for shards on elastic1037 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 823, number_of_pending_tasks: 23, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4864, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1449907337, [15:36:07] RECOVERY - ElasticSearch health check for shards on elastic1039 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 823, number_of_pending_tasks: 22, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4781, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1449907337, [15:36:07] RECOVERY - ElasticSearch health check for shards on elastic1026 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 823, number_of_pending_tasks: 22, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 4751, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1449907337, [15:36:15] RECOVERY - ElasticSearch health check for shards on elastic1042 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 821, number_of_pending_tasks: 38, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 10840, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1667938515, [15:36:16] RECOVERY - ElasticSearch health check for shards on elastic1038 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 818, number_of_pending_tasks: 50, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 14580, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.1994985283, [15:36:22] (03CR) 10Eevans: [C: 031] "Making a Cassandra tunable, tunable via Puppet, seems reasonable to me, (so +1)." [puppet] - 10https://gerrit.wikimedia.org/r/301083 (https://phabricator.wikimedia.org/T140869) (owner: 10Elukey) [15:36:35] RECOVERY - ElasticSearch health check for shards on elastic1017 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 76, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 27008, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.2976125586, [15:36:35] RECOVERY - ElasticSearch health check for shards on elastic1024 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 809, number_of_pending_tasks: 76, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 27002, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.2976125586, [15:36:35] RECOVERY - ElasticSearch health check for shards on elastic1027 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 806, number_of_pending_tasks: 90, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 32023, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.3303172354, [15:36:46] RECOVERY - ElasticSearch health check for shards on elastic1032 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.3630219121, acti [15:36:46] RECOVERY - ElasticSearch health check for shards on elastic1036 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 803, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3034, task_max_waiting_in_queue_millis: 0, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards_percent_as_number: 90.3630219121, acti [15:37:23] (03CR) 10ArielGlenn: [C: 032] add system::role to the dump related roles for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301142 (owner: 10ArielGlenn) [15:37:56] dcausse: Maybe just run it on the master? [15:38:55] Er, master_eligible. That'd be 8 nodes :) [15:39:16] lots of recovery. Can SWAT continue? [15:39:44] dcausse: Could also help detect a split brain, if one master_eligible (but not the others) reported shards missing. [15:40:09] * ostriches shrugs [15:42:01] (03PS1) 10ArielGlenn: remove 'enable' and 'ensure' class params from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/301143 [15:42:34] ostriches: indeed [15:43:16] * thcipriani continues SWAT [15:43:31] [= [15:43:52] mafk: if you're around, let's try to get yours out the door. [15:44:06] thcipriani: yep, I'm done with the other things [15:44:19] (03CR) 10Thcipriani: [C: 032] Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [15:44:33] * mafk enables x-wm-debug [15:44:37] dcausse: Plus, 8 failure/recoveries are nicer than like 30 :p [15:44:49] ircspam-- :p [15:45:33] (03PS6) 10Thcipriani: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [15:46:37] (03CR) 10Thcipriani: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [15:46:42] (03CR) 10Thcipriani: [C: 032] Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [15:47:07] !log reimage mw1292 as thumbor1002 [15:47:10] (03Merged) 10jenkins-bot: Configuration changes for mk.wiktionary.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300177 (https://phabricator.wikimedia.org/T140566) (owner: 10MarcoAurelio) [15:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:21] ostriches: well, at least I can't miss it :), I'll have a look at this check, err. I'll poke Guillaume about that :) [15:48:32] akosiaris: When you get a minute, I amended that bacula patch. Just trying to clean up all the old crud from ytterbium :) [15:48:47] dcausse: Sounds good. If you need a reviewer feel free to throw me on it. [15:48:53] thanks! [15:49:06] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: Connection refused [15:49:18] godog, so are we going to need a thumbor machine in deployment-prep? [15:49:42] (03PS1) 10Thcipriani: Beta: Fix non-puppetmaster errors [puppet] - 10https://gerrit.wikimedia.org/r/301144 [15:50:00] (03CR) 10Alexandros Kosiaris: [C: 032] Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad) [15:50:08] (03PS3) 10Alexandros Kosiaris: Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad) [15:50:09] mafk: https://gerrit.wikimedia.org/r/#/c/300177/ should be live on mw1099, check please [15:50:11] (03CR) 10Alexandros Kosiaris: [V: 032] Bacula: Remove old gerrit backup path, unused now [puppet] - 10https://gerrit.wikimedia.org/r/300905 (owner: 10Chad) [15:50:20] thcipriani: ack, checking [15:50:31] Krenair: mh I don't think so, I've been using deployment-imagescaler01 which was otherwise idle [15:50:39] ostriches: done :-) [15:50:43] Krenair: we could rename it tho, I'm not attached to the name [15:51:00] godog, I don't mind the name. I actually wasn't aware we had that machine [15:51:09] thcipriani: after changing a namespace name, shouldn't a namespacesDupes be run? (ping Krenair) [15:51:10] ostriches: thanks for cleaning up btw [15:51:18] No problem! [15:51:22] Like I said in the comments, this is the exact same data we're backing up from lead now, so if there's copies of ytterbium's data, drop away :) [15:51:24] mafk, hmm... can you link me to the change? [15:51:32] https://gerrit.wikimedia.org/r/#/c/300177/ [15:51:35] I think the answer is yes [15:51:50] thcipriani: all looks ok on mw1099 [15:51:57] dcausse: we do expose all services through lvs, so we could check cluster wide stats there. [15:52:06] mafk: ack, rolling out everywhere. [15:52:10] I don't see the new favicon though, it might be a caché thing [15:52:12] dcausse: and not on specific node [15:52:39] gehel: sounds like a good idea, I think it'll work [15:53:09] I don't like that idea, because if lvs is down (which is bad on its own), it causes cascading (and technically incorrect) failures as well. [15:53:10] mafk, this doesn't change namespaces? [15:53:20] oh I just hadn't scrolled down [15:53:22] I prefer to check directly from a node (which is why I suggested a master) [15:53:22] damn gerrit ui change [15:53:22] Krenair: renames Wiktionary to translated [15:53:27] lol [15:53:47] when we get used to it, we'll have to get used to Phabricator differential [15:54:00] Nah, we'll probably get the *new* gerrit UI first :p [15:54:02] checking master eligible nodes is nice also, if all are down search is down anyways [15:54:03] mafk: yup :P [15:54:07] (yes, I'm not joking :)) [15:54:07] :D [15:54:10] !log thcipriani@tin Synchronized static/favicon/wiktionary/mk.ico: SWAT: [[gerrit:300177|Configuration changes for mk.wiktionary.org]] PART I (duration: 00m 24s) [15:54:45] certainly when creating or deleting a namespace you'd run namespaceDupes [15:54:53] !log thcipriani@tin Synchronized static/images/project-logos/mkwiktionary.png: SWAT: [[gerrit:300177|Configuration changes for mk.wiktionary.org]] PART II (duration: 00m 24s) [15:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:58] I can't do that I think [15:54:59] I'm not sure it matters here. maybe run it without --fix to see [15:55:21] yup, will give it a shot. [15:55:41] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:300177|Configuration changes for mk.wiktionary.org]] PART III (duration: 00m 26s) [15:55:43] dcausse: we probably want a specific check for master eligible... [15:56:16] PROBLEM - puppet last run on mw2231 is CRITICAL: CRITICAL: puppet fail [15:56:16] yes, and maybe move the shard count check to this specific check [15:56:25] ostriches i thought that new ui is part of gerrit 3.0. [15:56:29] * gehel is going back to vacation. Will have a look later tonight... [15:56:46] yup: 632 links to fix [15:57:04] paladox: Lol, gerrit 3.0 is nowhere near fruition. Polygerrit is probably gonna land in another release or two. [15:57:11] Oh [15:57:17] It has already landed [15:57:26] in gerrit master waiting for release [15:57:42] mafk: check please [15:57:52] but is not the default and will require either us or them to build the war with polygerrit [15:58:05] paladox: Yeah in master, but I dunno if it'll make it into 2.13. There's a bunch of outstanding bugs and missing stuff. [15:58:09] polygerrit is broken on Internet Explorer, but works on microsoft edge [15:58:12] dare I ask what polygerrit is? [15:58:18] apergos: New UI. [15:58:19] It is the new ui [15:58:22] For gerrit [15:58:27] thcipriani: ack, rechecking, is namespaceDupes already? [15:58:38] apergos: https://gerrit-review.googlesource.com/c/79298/?polygerrit=1 (compare with =0 if you need a reminder of what it looks like now) [15:58:40] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495974 (10AlexMonk-WMF) >>! In T141339#2495617, @elukey wrote: >>>! In T141339#2495557, @AlexMonk-WMF wrote: >> 'statistics-users' seems redundant if you've got 'researcher... [15:58:57] mafk: yup, just run [15:59:00] ostriches it is deffintly a much better ui [15:59:06] my immediate reaction is that I prefer the poly [15:59:08] but still needs alot of improvements [15:59:14] I find the one we have now too cluttered [15:59:19] Yep [15:59:29] thcipriani: hmm https://mk.wiktionary.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D1%80%D0%B5%D1%87%D0%BD%D0%B8%D0%BA:%D0%91%D0%BE%D1%82%D0%BE%D0%B2%D0%B8 [15:59:30] apergos: It's not feature complete unfortunately. Only in master and hidden behind config flags / url params. [15:59:33] (03PS2) 10Giuseppe Lavagetto: Beta: Fix non-puppetmaster errors [puppet] - 10https://gerrit.wikimedia.org/r/301144 (owner: 10Thcipriani) [15:59:44] I'm hoping they finish before 2.13, but we'll see :) [15:59:45] bah hermberg [16:00:00] I used the new 'edit files' feature just today [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1600). Please do the needful. [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:14] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495976 (10elukey) I will follow up adding some clarity to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups [16:00:21] to edit the commit message. it probabyl would have been quicker to pop over to my editor and save/push but eh [16:00:28] just because I *could* [16:00:45] mafk: hmm, ran mwscript namespaceDupes.php mkwiktionary --fix didn't seem to be any problems :\ [16:01:03] apergos: My favorite use-case tbh is actually "I see someone else's random patch that has a small typo" [16:01:04] (03CR) 10Giuseppe Lavagetto: [C: 032] Beta: Fix non-puppetmaster errors [puppet] - 10https://gerrit.wikimedia.org/r/301144 (owner: 10Thcipriani) [16:01:16] Rather than actually fixing your own patches (which you likely have sitting on a branch locally anyway) [16:01:29] mafk: just did a dry run again: 0 pages to fix, 0 were resolvable. [16:01:30] this was 'woops I'ma gonna merge this so clearly it's no longer WIP' [16:01:34] thcipriani: maybe a bit of time it's still needed, anyway, all looks good (new favicon not showing though, maybe also needs a bit of time) [16:01:50] but yeah if it's someone else's patch that is teh perfect [16:01:55] https://github.com/gerrit-review/gerrit/tree/master/polygerrit-ui [16:01:56] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.047 second response time [16:02:01] apergos ostriches ^^ [16:02:11] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495987 (10Addshore) >>! In T141339#2495976, @elukey wrote: > I will follow up adding some clarity to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups... [16:02:15] https://mk.wiktionary.org/wiki/%D0%A1%D0%BF%D0%B5%D1%86%D0%B8%D1%98%D0%B0%D0%BB%D0%BD%D0%B0:%D0%9F%D1%80%D0%B8%D0%B4%D0%BE%D0%BD%D0%B5%D1%81%D0%B8/MarcoAurelio <-- all looks good here, the namespace names were changed rightly [16:02:19] mafk: ack. ok. mafk addshore : I'm going to bump the rest of the patches for SWAT so I can get out of the way for puppet SWAT. [16:02:31] thcipriani: okay! [16:02:32] still here :) [16:02:37] paladox: bookmarked to look at later [16:02:39] thankye [16:02:44] Ok and your welcome :) [16:02:47] Yeah we'll have to install node on the gerrit server! I feel all kinds of weird about that :p [16:03:06] java and node together! Just like...no one intended? ;-) [16:03:07] next up myphpadmin :-P [16:03:10] Oh, maybe node wont be required. [16:03:21] ostriches node isent required, you can either use go, or node [16:03:29] apergos: You say that like we've got php installed on the gerrit server :p [16:03:41] paladox: That's not what it says. [16:03:46] It says install node, and go is optional. [16:03:46] I believe in planning for the worst case :-P [16:03:52] Oh [16:04:24] Or you can do [16:04:25] buck build polygerrit && \ [16:04:25] java -jar buck-out/gen/polygerrit/polygerrit.war daemon --polygerrit-dev -d ../gerrit_testsite --console-log --show-stack- [16:04:28] ostriches ^^ [16:04:46] build it with buck, and setup a test instance where you can test it. [16:04:48] ? [16:04:50] (03CR) 10Alexandros Kosiaris: "That would be https://gerrit.wikimedia.org/r/#/c/301071/" [puppet] - 10https://gerrit.wikimedia.org/r/272613 (owner: 10BryanDavis) [16:05:14] Heh, I don't need to test it right now, far too immature to spend time on :) [16:05:18] * ostriches will just wait and watch instead [16:05:32] RECOVERY - puppet last run on mw2231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:33] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495993 (10AlexMonk-WMF) >>! In T141339#2495987, @Addshore wrote: >>>! In T141339#2495976, @elukey wrote: >> I will follow up adding some clarity to https://wikitech.wikimed... [16:06:50] (03CR) 10Alexandros Kosiaris: [C: 031] "@Faidon. it's the standard namespace issue we are facing with role::* as long as we import role/*. After getting rid of import role/* we w" [puppet] - 10https://gerrit.wikimedia.org/r/298911 (owner: 10Dzahn) [16:07:33] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2495994 (10Addshore) >>! In T141339#2495993, @AlexMonk-WMF wrote: > Yes. You can find this out in puppet (manifests/site.pp shows it has `role statistics::cruncher`, hierada... [16:08:22] (03CR) 10Faidon Liambotis: "Are these lint warnings like Daniel said or actual autoloader issues? (if it's the former, let's ignore them for now?)" [puppet] - 10https://gerrit.wikimedia.org/r/298911 (owner: 10Dzahn) [16:08:57] (03CR) 10Alexandros Kosiaris: [C: 031] ipmi: move role to module structure [puppet] - 10https://gerrit.wikimedia.org/r/298902 (owner: 10Dzahn) [16:09:41] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 (owner: 10Dzahn) [16:09:47] (03PS4) 10Alexandros Kosiaris: servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 (owner: 10Dzahn) [16:09:55] ostriches im hoping they promote it to stable and remove the flag and just allow you to choose the default in the preference [16:10:24] im also hoping they drop the cookie that they use to get it working [16:11:03] is there somewhere in eqiad that I could stash ~900G of data? [16:11:29] * bd808 hands urandom a very large thumb drive [16:12:18] bd808: heh [16:12:50] (03CR) 10Alexandros Kosiaris: [V: 032] servermon: move role to module, add system::role [puppet] - 10https://gerrit.wikimedia.org/r/298904 (owner: 10Dzahn) [16:13:54] urandom: fluorine is the biggest pool of disk I know about in eqiad but it only has 500G free (of 3.8T) [16:14:33] thanks for the swat thcipriani [16:14:44] will schedule the other two for tomorrow if I can [16:15:13] mafk: ack thanks for the patches, sorry the window ran out :( [16:15:15] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003.eqiad.wmnet for WMDE-jand - https://phabricator.wikimedia.org/T141339#2496019 (10elukey) Confirmed with ottomata that 'researchers' is the only one needed. [16:15:22] urandom: yeah for graphite data on labmon it has been done with an external drive iirc [16:15:50] thcipriani: not your fault, blame that ElasticSearch thing :P [16:15:51] oh damn, the thumb drive wasn't a joke... [16:16:25] 06Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2496020 (10ema) [16:16:40] 06Operations, 10Traffic: Age header reset to 0 after 24 hours on varnish frontends - https://phabricator.wikimedia.org/T141373#2496032 (10ema) p:05Triage>03Normal [16:17:00] urandom, bd808: If only we still had NFS! ;-) [16:17:19] poor old netapps [16:20:12] has tcy.wiki been created meanwhile? [16:21:04] RECOVERY - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is OK: TCP OK - 0.038 second response time on port 9042 [16:23:34] mutante: no, I don't think so - however tcy.wikipedia.org redirects to the test project in Incubator [16:23:51] i think i know where i could put it (at least temporarily), but it would involve transfering it to codfw; would it be a problem to rsync 900G of data from eqiad to codfw? [16:24:05] mafk: yes, i added it to DNS but wanted to know about the actual create_wiki script [16:24:12] mafk: looks like it. yea.. [16:24:20] ok [16:25:57] we need to get the mw-config change merged first [16:26:27] not sure if swat-able [16:26:46] urandom: no, unlikely it is a problem [16:30:31] (03PS1) 10Alex Monk: Replace manually-maintained bastiononly group with the new 'all-users' [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) [16:30:38] 06Operations, 13Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#2496057 (10AlexMonk-WMF) a:03AlexMonk-WMF [16:32:56] (03PS2) 10ArielGlenn: remove 'enable' and 'ensure' class params from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/301143 [16:33:52] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:33:53] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:34:02] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:34:17] (03CR) 10ArielGlenn: [C: 032] remove 'enable' and 'ensure' class params from snapshot manifests [puppet] - 10https://gerrit.wikimedia.org/r/301143 (owner: 10ArielGlenn) [16:36:02] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:36:03] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [16:36:09] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2496062 (10GWicke) After deploying the changes mentioned in T136957#2485532 yesterday, it looks like today's network issue did not result in any RB m... [16:36:12] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:41:11] (03CR) 10Dzahn: [C: 031] "makes sense to use upstream tool. seems to work for me on terbium." [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [16:41:57] (03PS1) 10ArielGlenn: clean up dumps dirs manifest for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301150 [16:42:09] mutante <3 also https://gerrit.wikimedia.org/r/#/c/301059/1? [16:42:31] mutante I also wrote some docs https://wikitech.wikimedia.org/wiki/LDAP#Common_LDAP_administrative_actions [16:44:42] YuviPanda: the one for groups .. i tried that just now [16:44:45] ldapvi -b ou=groups uid="$GROUP" [16:44:53] cn=$GROUP [16:44:55] not uid= [16:45:04] yea, but that is pasted from your change [16:45:07] i was about to ask [16:45:07] I should fix [16:45:11] right [16:46:21] YuviPanda: very useful to check the group and +100 on the comments about removing old access [16:46:36] i'll get some food and be back [16:46:57] and yep, makes sense to use the upstream code [16:47:11] cool [16:48:05] the only drawback if there is one at all.. it seems a bit easier to make a bad mistake [16:48:14] (03PS2) 10Yuvipanda: ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) [16:48:15] mutante updated [16:48:16] (03PS3) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) [16:48:18] (03PS7) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) [16:48:20] (03PS2) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) [16:48:22] (03PS2) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) [16:49:40] mutante yeah, but you can review your changes with ldapvi before committing them [16:50:05] (03CR) 10Elukey: "Ran another time pcc and same result: https://puppet-compiler.wmflabs.org/3483/" [puppet] - 10https://gerrit.wikimedia.org/r/301083 (https://phabricator.wikimedia.org/T140869) (owner: 10Elukey) [16:50:18] YuviPanda: one more comment. in the new PS now one script is in /usr/local/bin and the other in /usr/local/sbin [16:50:32] right, I'll move them to /usr/local/bin [16:50:36] YuviPanda: good point, yep [16:50:45] (03CR) 10ArielGlenn: [C: 032] clean up dumps dirs manifest for snapshots [puppet] - 10https://gerrit.wikimedia.org/r/301150 (owner: 10ArielGlenn) [16:51:46] mutante updated [16:51:58] (03PS3) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) [16:52:00] (03PS3) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) [16:53:02] (03CR) 10Dzahn: [C: 031] ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [16:53:08] hey, this should not be happening -> https://commons.wikimedia.org/w/index.php?title=User_talk:DerHexer&action=rollback&from=&token=5345d636ef4ede27d1c74a063f57618757979546%2B%5C [16:53:09] lgtm [16:53:15] (03PS1) 10ArielGlenn: move cronjobs class from role to snapshot module and add user param [puppet] - 10https://gerrit.wikimedia.org/r/301161 [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1700). Please do the needful. [17:00:34] will deploy parsoid in a little bit. [17:03:32] !log starting parsoid deploy [17:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:51] (03PS2) 10ArielGlenn: move cronjobs class from role to snapshot module and add user param [puppet] - 10https://gerrit.wikimedia.org/r/301161 [17:05:44] things I love about the new gerrit: it's smart enough to figure out when you've just done a rebase or a commit message update via git on command line [17:05:48] used to not be so [17:06:11] !log synced new parsoid code; restarted parsoid on wtp1007 as a canary [17:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:20] (03CR) 10ArielGlenn: [C: 032] move cronjobs class from role to snapshot module and add user param [puppet] - 10https://gerrit.wikimedia.org/r/301161 (owner: 10ArielGlenn) [17:08:44] apergos yeh i think there is a bug that if you push an update change it shows the pevous change name. [17:09:00] (03CR) 10BryanDavis: "Looks to be causing:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301071 (https://phabricator.wikimedia.org/T141242) (owner: 10Giuseppe Lavagetto) [17:10:40] <_joe_> bd808: heh, you're right [17:10:45] !log starting branch cut for 1.28.0-wmf.12 [17:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:03] <_joe_> bd808: can that wait until tomorrow morning my time? [17:11:04] (03PS1) 10ArielGlenn: get rid of the useless snapshot cron role wrapper [puppet] - 10https://gerrit.wikimedia.org/r/301167 [17:11:16] <_joe_> on a second puppet run it should fix itself, right? [17:11:30] nope. it's permanently hosed [17:11:37] <_joe_> oh, dear [17:11:42] <_joe_> yeah let me fix it [17:11:43] (03CR) 10Chad: [C: 031] Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [17:11:45] because the initial conversion breaks part way through [17:11:48] I've got a ptach [17:11:58] <_joe_> oh, were is it? [17:12:03] <_joe_> *where [17:12:16] In my edit buffer :) [17:12:20] (03CR) 10Dzahn: [C: 031] "yea, i think it could be from long time ago like you say...as long as you are sure it's not used in Labs" [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:12:23] <_joe_> AHAH OK [17:12:30] <_joe_> I'll be back in a few then [17:12:31] !log finished deploying parsoid version 285b6983 [17:12:34] time to verify .. [17:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:36] <_joe_> (was callign it a day) [17:13:17] (03PS1) 10BryanDavis: Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 [17:13:29] (03CR) 10Dzahn: [C: 032] Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [17:13:36] (03PS8) 10Dzahn: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [17:13:43] mutante ^^ thanks :) :) [17:14:54] mutante https://gerrit.wikimedia.org/r/#/c/301001/8 will need to be c+2 again please. [17:15:02] Since you rebased it after doing c+2 [17:15:26] (03CR) 10ArielGlenn: [C: 032] get rid of the useless snapshot cron role wrapper [puppet] - 10https://gerrit.wikimedia.org/r/301167 (owner: 10ArielGlenn) [17:15:43] i know, just have to wait a moment [17:15:50] YuviPanda: want to help unbreak all new self-hosted puppetmasters in Labs? https://gerrit.wikimedia.org/r/#/c/301168/ [17:15:52] (03PS9) 10Paladox: Update gerrit css to use the new defined css in gerrit 2.12 [puppet] - 10https://gerrit.wikimedia.org/r/301001 (https://phabricator.wikimedia.org/T141286) [17:16:07] ok [17:16:08] sorry [17:16:24] (03CR) 10Chad: "So I was looking at mysql-connector-java but I can't seem to find it." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [17:16:35] paladox: what was PS9 ? [17:16:47] A rebase [17:16:51] it showed merge conflict [17:16:57] due to it being fast forward [17:17:02] paladox: look at what PS8 was [17:17:09] Yeh [17:17:12] but showed it again [17:17:18] hi bd808 [17:17:21] due to something being merged after you rebased [17:17:27] it's not making it faster to add more PS [17:17:33] while waiting for the bot [17:17:36] (03PS2) 10Yuvipanda: Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis) [17:17:37] Sorry [17:17:53] np, it will be live shortly [17:18:00] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis) [17:18:03] Ok [17:18:25] (03PS1) 10ArielGlenn: include the dumps packages in the dumps manifest in snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/301169 [17:18:27] mutante I'm going to wait for ostriches to chime in about modify-ldap-groups and stuff since he also does these things and then merge [17:18:46] paladox: now it's merged. thing is that action grrrit-wm did not talk about [17:18:52] fucking gerrit with a fucking different submit fucking button aaarfgghh [17:18:57] YuviPanda: that sounds good , yes [17:19:00] Oh [17:19:06] (03CR) 10Chad: [C: 031] "Fine by me, didn't use it anyway." [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:19:12] (03CR) 10Chad: [C: 031] "Fine by me, didn't use it anyway." [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:19:14] it's like the anti-honeymoon-period all over again [17:19:16] * bd808 hugs YuviPanda and tells _joe_ to head off to bed/life [17:19:31] (03PS3) 10Yuvipanda: Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis) [17:19:32] Lol it showed the change twice ^^ [17:19:33] (03CR) 10Yuvipanda: [V: 032] Fix dependency ordering for self-hosted puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis) [17:19:49] bd808 yw [17:19:55] bd808 I've merged it now [17:20:12] (03PS8) 10Yuvipanda: ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) [17:20:15] I'll nuke my busted instance and try again :) [17:20:16] I mean I'll have to figure out how to use this ldapvi thing, but I'll live. [17:20:22] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Replace change-ldap-password with reset-ldap-password [puppet] - 10https://gerrit.wikimedia.org/r/301048 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:20:25] (don't let my fear of new things stop you :p) [17:20:48] (03CR) 10Chad: [C: 031] "sure why not" [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:20:59] (03CR) 10Chad: [C: 031] "move fast and break things?" [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:21:24] ostriches I wrote up https://wikitech.wikimedia.org/wiki/LDAP#Common_LDAP_administrative_actions [17:21:28] k [17:21:35] I'm trying to move fast, gerrit won't let me [17:21:48] (03PS4) 10Yuvipanda: ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) [17:21:56] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Drastically simplify modify-ldap-user [puppet] - 10https://gerrit.wikimedia.org/r/301052 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:22:06] (03PS3) 10Yuvipanda: ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) [17:22:11] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Remove unused homedirectorymanager [puppet] - 10https://gerrit.wikimedia.org/r/301053 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:22:18] (03PS4) 10Yuvipanda: ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) [17:22:24] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Vastly simplify modify-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/301059 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:22:41] (03PS4) 10Yuvipanda: ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) [17:22:46] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Add warning to ldaplist [puppet] - 10https://gerrit.wikimedia.org/r/301061 (https://phabricator.wikimedia.org/T114063) (owner: 10Yuvipanda) [17:23:15] done! [17:23:23] thank you, mutante / ostriches / krenair [17:23:53] (03PS2) 10ArielGlenn: include the dumps packages in the dumps manifest in snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/301169 [17:26:25] (03CR) 10ArielGlenn: [C: 032] include the dumps packages in the dumps manifest in snapshot module [puppet] - 10https://gerrit.wikimedia.org/r/301169 (owner: 10ArielGlenn) [17:29:50] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:30:28] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:12] ostriches i now know what .commentPanelMessage did, it's css class name has changed in gerrit 2.12. [17:36:17] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:37:41] (03CR) 10Yuvipanda: [C: 04-1] "I *hate* ldaplist (and everything that uses the stupid ldapsupportlib, in fact). I've https://phabricator.wikimedia.org/T114063 open to ki" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [17:38:00] (03CR) 10BryanDavis: "Aargh. This actually breaks things worse. Now we have:" [puppet] - 10https://gerrit.wikimedia.org/r/301168 (owner: 10BryanDavis) [17:39:54] (03PS1) 10Paladox: Fix gerrit's css class .commentPanelMessage [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) [17:40:29] (03PS2) 10Paladox: Gerrit: fix gerrit's css class .commentPanelMessage [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) [17:40:37] ostriches ^^ [17:41:00] (03PS1) 10Eevans: Enable Cassandra instance restbase2005-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) [17:42:42] (03PS3) 10Paladox: Gerrit: fix gerrit's css class .commentPanelMessage name [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) [17:43:17] (03CR) 10Muehlenhoff: "The name of the binary package is libmysql-java, mysql-connector-java is the source package name" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [17:46:11] (03CR) 10Eevans: [C: 04-1] "Disregard, just queuing this up for now; I will signal readiness with a +1 when ready." [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [17:46:19] (03PS1) 10BryanDavis: role::puppet::self: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/301175 [17:47:41] 06Operations, 06Services, 10Wikimedia-Logstash: New Kibana dashboards timing out consistently - https://phabricator.wikimedia.org/T141384#2496381 (10GWicke) [17:48:07] 06Operations, 06Services, 10Wikimedia-Logstash: New Kibana dashboards timing out consistently - https://phabricator.wikimedia.org/T141384#2496393 (10GWicke) p:05Triage>03High [17:48:41] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496381 (10GWicke) [17:49:03] (03PS1) 10Eevans: Enable Cassandra instance restbase1009-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) [17:49:42] (03PS1) 10Yuvipanda: Add domain labtestspice.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/301177 [17:49:45] 06Operations, 10ops-codfw, 10ops-eqiad: ship 7 ex4200s from codfw to eqiad - https://phabricator.wikimedia.org/T140655#2496402 (10Cmjohnson) 05Open>03Resolved I received the 7 switches, the box used was too big and 4 switches must have rolled around several times inside. The box when it arrived is badly... [17:50:41] (03CR) 10Eevans: [C: 04-1] "Disregard, just queuing this up for now; I will signal readiness with a +1 when ready." [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [17:52:00] (03PS1) 10Yuvipanda: cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 [17:52:06] 06Operations, 10ops-eqiad: Survey available/unused ports on eqiad pfw's - https://phabricator.wikimedia.org/T141363#2496417 (10faidon) OK, did a little more investigation. pfw-eqiad is a cluster of two SRX650s, each with 4x1Gbps built-in ports, 16x1Gbps in a GPIM and 2x10G in an XPIM. The SRX platform in a c... [17:52:09] andrewbogott ^ and the DNS change for misc-varnish! [17:52:26] (03CR) 10Andrew Bogott: "I think ldaplist is a useful tool. Just because it uses a library with bugs in it doens't mean that everything is poisoned start to finis" [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [17:53:03] (03CR) 10Yuvipanda: "What are the concrete cases for ldaplist? If those are identified and listed I'll happily rewrite it to not suck." [puppet] - 10https://gerrit.wikimedia.org/r/295475 (owner: 10Alexandros Kosiaris) [17:53:25] (03CR) 10jenkins-bot: [V: 04-1] cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 (owner: 10Yuvipanda) [17:53:37] YuviPanda: Thanks! But the thing I was saying before is still true… that means it'll hit the service on http rather than https, and it doesn't work on http [17:53:41] so I still have that problem [17:54:01] andrewbogott oooh, it doesn't respect X-Forwarded-Proto [17:54:02] ? [17:54:17] right, then that's much more complicated then. [17:54:27] (03PS2) 10Yuvipanda: cache: Add labtestspice.wikimedia.org behind misc varnish [puppet] - 10https://gerrit.wikimedia.org/r/301178 [17:55:43] YuviPanda: I don't know… just that there's no config option to tell it whether to talk https or http. and if I point to it with an http url it says 'Error: Unexpected protocol mismatch. [17:56:11] right. so if it respects X-Forwarded-Proto that sohuld work with misc-varnish [17:56:12] but we can test that I think [17:56:18] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496429 (10bd808) I think this has something to do with the search term highlighting that kibana4 does server side. I'll poke at it a bit and s... [18:00:53] 06Operations, 10ops-eqiad, 10media-storage, 13Patch-For-Review: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2496454 (10Cmjohnson) [18:00:55] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2496452 (10Cmjohnson) 05Open>03Resolved I received the disk and replaced it root@ms-be1022:~# hpssacli ctrl slot=3 ld all show status logicaldrive 1 (186.3 GB, 0): OK logica... [18:01:11] 06Operations, 10ArchCom-RfC, 06Services, 07Archcom-has-shepherd, 07RfC: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#2496458 (10RobLa-WMF) [18:02:04] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496461 (10GWicke) Thanks, @bd808! [18:03:05] (03PS1) 10ArielGlenn: provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 [18:03:37] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.113:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.113, port=9200): Read timed out. (read timeout=4) [18:05:21] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2496465 (10Cmjohnson) I requested 2 SSD"s to be sent and the confirmation email states 2 SSD's but they actually sent me 2 4TB HDD's instead. A call to them has to take place. [18:07:04] !log Restarted elasticsearch on logstash1003, couldn't find master [18:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:29] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 30, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards [18:08:06] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496490 (10bd808) From [[https://www.elastic.co/guide/en/kibana/current/advanced-options.html|upstream docs]]: > **doc_table:highlight** > High... [18:08:10] thcipriani: and also, I would still be around after the train if you would have the time! [18:09:20] addshore: kk, I'll ping you when I'm done Train-ing. [18:12:34] varnish is hating kibana4 right now :/ [18:13:51] bd808: in what way? [18:14:45] I think the kibana4 nodes are timing out on some queries and then varnish is marking the node as offline for a few minutes [18:15:16] and they we run out of nodes and get a "service is dead" type response from varnish [18:15:24] I don't think varnish is actually doing anything wrong here [18:15:29] (03PS2) 10ArielGlenn: provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 [18:15:32] just bad alignment of stars [18:15:56] also I continue to hate kibana4 :) [18:16:28] adding a node proxy into the mix does not make it more stable or performant as far as I can tell [18:19:08] bd808: Something something, onions and layers :p [18:19:37] bad alignment of tors [18:19:55] YuviPanda: I really broke self-hosted puppet with that patch I tricked you into merging. I think that https://gerrit.wikimedia.org/r/#/c/301175/1 will unbreak both the new and old breakage [18:20:43] (03CR) 10Yuvipanda: [C: 032] role::puppet::self: Break dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/301175 (owner: 10BryanDavis) [18:20:55] bd808 done [18:21:04] (03PS2) 10Addshore: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) [18:21:11] (03CR) 10Addshore: [C: 032] Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [18:21:36] (03Merged) 10jenkins-bot: Enable RevisionSlider on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301105 (https://phabricator.wikimedia.org/T138943) (owner: 10Addshore) [18:22:03] (03CR) 10ArielGlenn: [C: 032] provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 (owner: 10ArielGlenn) [18:22:13] (03PS3) 10ArielGlenn: provide and use variable names for all dump related directories [puppet] - 10https://gerrit.wikimedia.org/r/301180 [18:25:18] (03CR) 10Dzahn: [C: 031] Replace manually-maintained bastiononly group with the new 'all-users' [puppet] - 10https://gerrit.wikimedia.org/r/301149 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [18:25:45] YuviPanda: thanks [18:26:58] (03CR) 10Chad: "Ah ok, makes sense. I was looking at libbcprov-java too, but it looks like it might be too dated for our use :\" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (https://phabricator.wikimedia.org/T70271) (owner: 10Chad) [18:28:04] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: Enable RevisionSlider on mediawikiwiki {{gerrit|301105}} (duration: 01m 28s) [18:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:16] (03PS1) 10ArielGlenn: use fixed repodir setting for dumps jobs [puppet] - 10https://gerrit.wikimedia.org/r/301181 [18:31:10] (03PS4) 10Dzahn: Gerrit: fix gerrit's css class .commentPanelMessage name [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [18:32:22] (03CR) 10Dzahn: [C: 032] "very similar to the issue we merged earlier, but for the commit-message. noticed that issue with the scrollbar too" [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [18:33:28] (03CR) 10ArielGlenn: [C: 032] use fixed repodir setting for dumps jobs [puppet] - 10https://gerrit.wikimedia.org/r/301181 (owner: 10ArielGlenn) [18:33:52] (03CR) 10BryanDavis: "This breaks the dependency cycle, but on the one test host I built with it (striker-deploy01.striker.eqiad.wmflabs) the initial puppet run" [puppet] - 10https://gerrit.wikimedia.org/r/301175 (owner: 10BryanDavis) [18:34:18] (03PS5) 10Dzahn: Gerrit: fix gerrit's css class .commentPanelMessage name [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [18:36:23] !log addshore@tin Synchronized php-1.28.0-wmf.11/extensions/WikimediaEvents/WikimediaEventsHooks.php: dewiki_diffstats add rev timestamps & feature state {{gerrit|301119}} (duration: 00m 28s) [18:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:51] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2496549 (10Jgreen) 05Open>03Resolved [18:37:19] thcipriani: nothing exploded ;) [18:37:43] (03CR) 10Dzahn: [V: 032] "was tested on http://gerrit-test.wmflabs.org/gerrit/#/c/16/" [puppet] - 10https://gerrit.wikimedia.org/r/301172 (https://phabricator.wikimedia.org/T141286) (owner: 10Paladox) [18:37:45] addshore: \o/ kudos on the first successful "swat"-ish :) [18:39:25] mutante ^^ thanks [18:40:39] commit-msg should now be pre-wrapped again in gerrit :) [18:40:44] paladox: applied [18:40:46] ostriches ^^ [18:40:47] thanks [18:41:30] (03CR) 10Dzahn: "yes, should be limited to eventbus things. if kafka things are needed in the future that should probably be kafka-admins or so" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [18:42:18] (03CR) 10Dzahn: "i think the * in the sudo line means that it actually allows controlling any service. better to avoid that wildcard and actually list the " [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [18:47:03] (03PS1) 10ArielGlenn: add notes about manual additions needed for new snapshot nodes [puppet] - 10https://gerrit.wikimedia.org/r/301182 [18:49:24] (03CR) 10ArielGlenn: [C: 032] add notes about manual additions needed for new snapshot nodes [puppet] - 10https://gerrit.wikimedia.org/r/301182 (owner: 10ArielGlenn) [18:58:39] bd808: we can alter the timeouts for varnish->kibana, too [18:58:58] bd808: there's separate timeouts for first connect, first byte, idle between response bytes, etc [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T1900). [19:00:16] * thcipriani does [19:02:57] (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301186 [19:07:26] !log thcipriani@tin Purged l10n cache for 1.28.0-wmf.10 [19:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:10] * Reedy wonders which further extensions can be moved to extenson.json in extension-list after todays deploy [19:09:19] !log thcipriani@tin Started scap: testwiki to php-1.28.0-wmf.12 and rebuild l10n cache [19:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:49] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496627 (10EBernhardson) This also looks like it could be related to the mapping that is generated for restbase: ``` ebernhardson@logstash1001... [19:12:31] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496642 (10bd808) Timeouts still seem to happen (and I still see highlighted terms on other dashboards that load). More strangeness: * [[https... [19:12:52] ebernhardson: jinx! I think we came to the same conclusion [19:13:15] the records out of restbase are a bit too structured [19:15:20] bd808: :) [19:16:42] i had noticed that before and wondered as well, but it didn't seem to be causing issues so didn't think about it till now [19:16:54] (03PS1) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 [19:17:17] (03CR) 10jenkins-bot: [V: 04-1] Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda) [19:18:39] (03PS2) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 [19:18:41] (03PS2) 10Yuvipanda: python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 [19:19:39] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496670 (10GWicke) > This also looks like it could be related to the mapping that is generated for restbase Has this mapping changed recently,... [19:22:35] (03CR) 10Dzahn: [C: 04-1] Create the group eventbus-admins [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [19:23:02] (03CR) 10Dzahn: "yep, please remove the * and list the actual subcommands" [puppet] - 10https://gerrit.wikimedia.org/r/300860 (https://phabricator.wikimedia.org/T141013) (owner: 10Elukey) [19:31:27] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on port 9042 [19:32:03] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496712 (10GWicke) > My new theory about why kibana4 and restbase aren't getting along is the incredibly high cardinality of fields in the rest... [19:33:10] !log T140825: Setting vm.dirty_background_bytes=24576 (restbase1009.eqiad.wmnet) [19:33:11] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:33] (03PS3) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 [19:33:40] !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-a.eqiad.wmnet) [19:33:41] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:33:42] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:34:48] !log thcipriani@tin Finished scap: testwiki to php-1.28.0-wmf.12 and rebuild l10n cache (duration: 25m 29s) [19:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:41] !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1009-b.eqiad.wmnet) [19:37:43] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:37:43] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:44] (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301186 (owner: 10Thcipriani) [19:38:57] (03PS5) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 [19:39:10] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301186 (owner: 10Thcipriani) [19:40:31] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.12 [19:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:27] (03PS6) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 [19:42:50] !log T140825: Setting vm.dirty_background_bytes=24576 (restbase1014.eqiad.wmnet) [19:42:51] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:05] !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-a.eqiad.wmnet) [19:43:06] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:43:07] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:01] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496746 (10GWicke) The best bet for the source of those `err_*` keys I have so far is https://github.com/wikimedia/hyperswitch/blob/a884a7c1afe... [19:49:36] !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1014-b.eqiad.wmnet) [19:49:38] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:49:38] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:45] !log T140825: Setting vm.dirty_background_bytes=24576 (restbase1015.eqiad.wmnet) [19:53:46] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:53:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:06] !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-a.eqiad.wmnet) [19:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:18] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:54:18] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:57:15] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496801 (10bd808) >>! In T141384#2496712, @GWicke wrote: >> My new theory about why kibana4 and restbase aren't getting along is the incredibly... [19:57:28] (03CR) 10Brian Wolff: "Note, this will change the default thumbnail size fetched for galleries, which could cause a spike in requests to thumbnail servers. Given" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [19:58:40] !log T134016, T140825: Restarting Cassandra to disable trickle_fsync and streaming socket timeouts (restbase1015-b.eqiad.wmnet) [19:58:42] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:58:42] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [19:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:58:57] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:59:53] (03CR) 10BryanDavis: [C: 032] python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 (owner: 10Yuvipanda) [20:01:28] (03Merged) 10jenkins-bot: python: Load python and python3 plugins [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301014 (owner: 10Yuvipanda) [20:01:51] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2496812 (10Gehel) It would probably be easier for @Jonas to have direct access to the nginx logs on the wdqs servers. I'm not familiar to how we handle that (... [20:04:14] (03CR) 10Eevans: [C: 031] "I'm ready for this to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:04:40] mutante: I have a couple of Cassandra instances to do, can you help? [20:04:53] (03PS2) 10Eevans: Enable Cassandra instance restbase1009-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) [20:05:26] (03CR) 10BryanDavis: [C: 04-1] Fix generic webservices (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda) [20:06:14] urandom: yes [20:06:51] mutante: cool, https://gerrit.wikimedia.org/r/301176 should be first, and then once i have that running, https://gerrit.wikimedia.org/r/#/c/301174 [20:10:52] ok, sec [20:10:56] urandom: on it now [20:11:00] mutante: kk [20:11:34] (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase1009-c.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301176 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:12:06] urandom: first one is active now [20:12:17] mutante: thanks [20:15:19] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496842 (10Pchelolo) Actually the reason of these `err` keys is that sometimes we log the full request/responce. If it's an object with randomi... [20:16:28] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/301192 [20:23:57] !log T134016: Bootstrapping restbase1009-c.eqiad.wmnet [20:23:59] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:02] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jonas Kress - https://phabricator.wikimedia.org/T140911#2496878 (10Dzahn) Would it be ok with everyone here if we confirm the grafana part works, close this ticket as resolved (since it was all about NDA) and open... [20:26:47] (03CR) 10Eevans: [C: 031] "Ready." [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:26:51] (03PS2) 10Eevans: Enable Cassandra instance restbase2005-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) [20:27:13] mutante: ready for https://gerrit.wikimedia.org/r/#/c/301174/ whenever you are. [20:27:21] mutante: no rush btw. [20:27:57] (03CR) 10Dzahn: [C: 032] Enable Cassandra instance restbase2005-c.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/301174 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:28:22] urandom: no problem, done [20:28:24] (03PS7) 10Yuvipanda: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 [20:28:32] mutante: awesome, thanks! [20:28:41] urandom: i like that how you prepare them and then mark +1 when it's time [20:29:17] mutante: ok, cool; i'll be sure to keep doing that [20:29:23] (03CR) 10Yuvipanda: Fix generic webservices (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda) [20:29:36] bd808 ^ fixed Tue, but no need for .extend [20:33:57] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496912 (10EBernhardson) >>! In T141384#2496670, @GWicke wrote: >> This also looks like it could be related to the mapping that is generated fo... [20:36:14] thcipriani: o/ MW train done? mind if i do a quick mobileapps deployment? [20:36:44] PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused [20:36:50] got it ^^ [20:37:09] !log Bootstrapping restbase2005-c.eqiad.wmnet [20:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:53] mdholloway: yup, train is done, all yours [20:37:54] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-29 20:37:34. [20:38:08] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: Connection refused daniel_zahn . [20:38:13] thcipriani: great, thx! [20:38:33] (03PS9) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [20:39:28] (03CR) 10ArielGlenn: "updated this patch after move of misc crons to snapshot1007 and refactor of cron job classes" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [20:40:26] YuviPanda: your reset-ldap-password change has a problem on terbium [20:41:12] mutante uh, what is the issue? [20:41:18] also can you file a bug? I"m about to go for lunch just now [20:41:47] !log starting mobileapps deployment [20:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:02] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2496951 (10EBernhardson) >>! In T141384#2496842, @Pchelolo wrote: > Actually the reason of these `err` keys is that sometimes we log the full r... [20:42:09] YuviPanda: sure, you got a bug when you get back :) [20:42:21] Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/ldap/reset-password [20:43:28] (03PS1) 10Yuvipanda: Add maven to jdk8 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/301198 [20:43:40] (03PS1) 10Yuvipanda: ldap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/301199 [20:43:42] mutante ah, ^ should fix that. [20:43:44] can you merge? [20:43:47] (the puppet change) [20:43:49] looking [20:43:55] !log mobileapps deployed fd3f33b [20:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:09] oh! sure [20:44:41] (03CR) 10Dzahn: [C: 032] ldap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/301199 (owner: 10Yuvipanda) [20:45:38] oh, first time i see "submit including parents" [20:45:57] (03PS2) 10Dzahn: ldap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/301199 (owner: 10Yuvipanda) [20:45:57] sounds dangerous [20:46:01] yes [20:49:00] hmm... do we have to run any script to update the favicon of a project? [20:49:14] (03CR) 10BryanDavis: [C: 032] Add maven to jdk8 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/301198 (owner: 10Yuvipanda) [20:49:32] (03Merged) 10jenkins-bot: Add maven to jdk8 image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/301198 (owner: 10Yuvipanda) [20:50:35] hashar: thanks for working on CI integration for operations/debs, but please exclude "linux" and "linux44", these are really massive builds and it only makes sense to trigger a test build manually [20:51:24] it's even likely that they deplete disk space on the build VMs entirely... [20:52:45] mafk: it's created by favicon.php i think [20:53:24] I updated mk.wiktionary favicon on today's morning SWAT and while the patch is merged, favicon is still the old one mutante [20:54:02] sync-common-file? [20:58:57] (03PS1) 10Reedy: 4 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301283 (https://phabricator.wikimedia.org/T139800) [20:59:33] (03CR) 10Hashar: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/301192 (owner: 10Hashar) [21:00:06] mafk: i dont know. maybe the SWATter person odes [21:00:08] does [21:01:09] mutante: how would you transfer data between an arbitrary eqiad host and a codfw host? [21:01:39] netcat + openssl ? [21:01:51] hashar: what about resuming a broken transfer? [21:01:55] urandom: if it's small and not secret, i would use scp to localhost, scp -3 [21:02:05] mutante: it's ~900G [21:02:09] rsync!!!!!! [21:02:14] hashar: how? [21:02:19] hashar: over ssh? [21:02:20] urandom: i let puppet setup rsyncd [21:02:31] urandom: no, over rsync protocol to rsycnd [21:02:33] no need for rsyncd [21:02:38] if you get ssh [21:02:43] mutante: OK, yeah, someone else suggested that [21:02:45] you dont get ssh [21:03:01] urandom: you can copy it from an existing class.. hold on [21:03:04] mutante: do you know of a good example to crib from the repo? [21:03:32] (03CR) 10Reedy: [C: 032] 4 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301283 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [21:03:51] urandom: modules/role/manifests/lists/migration.pp for example [21:03:58] (03Merged) 10jenkins-bot: 4 more to extension.json in extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301283 (https://phabricator.wikimedia.org/T139800) (owner: 10Reedy) [21:04:44] urandom: so you put a role like that on the target server first, and adjust the source IP in there [21:05:10] mutante: nice; and ferm too i assume [21:05:19] !log reedy@tin Synchronized wmf-config/extension-list: moar extension.json (duration: 00m 33s) [21:05:21] urandom: that will add rsyncd and ferm rules and with rsync::server::module you add oneor more modules [21:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:28] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:07:05] urandom: just that if you remove that class again, puppet will not clean up. gotta remember to manually kill rsyncd and the configs when done [21:07:19] :/ [21:07:36] yeah, this is temporary... [21:07:55] is it a thing where you also have to copy the data back in the other direction ? [21:07:56] and it needs to be added to only a small number of hosts (3) [21:08:09] mutante: some day [21:08:11] in that case i edit the IP and move the role to another node [21:08:57] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: Connection refused [21:09:02] got it ^^ [21:09:14] (03PS1) 10Chad: WIP: Bring over php::ini from MediaWiki vagrant [puppet] - 10https://gerrit.wikimedia.org/r/301285 [21:09:38] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-07-27 21:09:22. [21:09:39] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: Connection refused daniel_zahn . [21:14:20] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2497030 (10GWicke) > Perhaps there is something similar that can be handled in node to filter the output? Yes, we can definitely sanitize this... [21:15:55] (03CR) 10BryanDavis: [C: 032] Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda) [21:17:18] (03Merged) 10jenkins-bot: Fix generic webservices [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/301188 (owner: 10Yuvipanda) [21:19:38] 06Operations, 10Cassandra, 10hardware-requests: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2331203 (10GWicke) Another option for large-scale Cassandra testing that we can pursue in parallel is using cloud infrastructure like GCE. A recent demo showed about [1000 Ca... [21:21:40] (03CR) 10Chad: "Passed puppet compiler! https://puppet-compiler.wmflabs.org/3491/lead.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/300930 (owner: 10Chad) [21:22:21] (03PS2) 10Chad: Gerrit: Remove all the junk to support 2.8 [puppet] - 10https://gerrit.wikimedia.org/r/300930 [21:26:06] (03PS1) 10Dzahn: ldap: fix typo for reset-password script name [puppet] - 10https://gerrit.wikimedia.org/r/301288 [21:27:11] (03CR) 10Dzahn: [C: 032] ldap: fix typo for reset-password script name [puppet] - 10https://gerrit.wikimedia.org/r/301288 (owner: 10Dzahn) [21:29:47] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:44:06] (03PS1) 10Andrew Bogott: Set up spice-based remote consoles for Labs instances [puppet] - 10https://gerrit.wikimedia.org/r/301294 (https://phabricator.wikimedia.org/T141399) [21:47:07] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2497082 (10ellery) @BBlack, @Nuria In order to run a randomized controlled experiment, you need to ensure that users are randomly assigned to treatment conditions at the start... [21:50:23] ostriches, i spoke with qchris_ they seem to now be prioritising gerrit's ui, they going with the new ui which i presume is polygerrit [21:50:28] (03PS2) 10Andrew Bogott: Add domain labtestspice.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/301177 (https://phabricator.wikimedia.org/T141399) (owner: 10Yuvipanda) [21:50:35] mmeaning there will be less likly to fix the current ui [21:50:40] since there will be no point [21:50:43] mutante ^^ [21:51:12] RoanKattouw, seems the only way the problem you had with the diffs will be when they fix it in the next new ui [21:56:56] 06Operations, 06Services, 10Wikimedia-Logstash: Kibana / logstash dashboards timing out consistently since Kibana upgrade - https://phabricator.wikimedia.org/T141384#2497094 (10GWicke) A PR removing the full request body from storage backend errors is now available at https://github.com/wikimedia/restbase-mo... [22:10:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [22:10:57] (03PS1) 10Dzahn: restbase-test: setup rsync for data from cassandra-test [puppet] - 10https://gerrit.wikimedia.org/r/301303 [22:11:08] paladox: yep, ok [22:11:21] mutante not sure why your saying yep ok. [22:12:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [22:12:36] paladox: well, you pinged me [22:12:46] Oh [22:12:55] and i saw what you were pointing to, but i dont have much to add right now [22:12:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [22:13:42] ok [22:14:07] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5065553 keys - replication_delay is 0 [22:27:37] (03CR) 10Jforrester: [C: 04-1] "Let's not be hasty." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301129 (https://phabricator.wikimedia.org/T141349) (owner: 10Jforrester) [22:31:34] (03PS1) 10Ppchelko: Change-Prop: Increase maximum concurrency for ORES [puppet] - 10https://gerrit.wikimedia.org/r/301305 [22:32:49] 06Operations, 10Analytics: stat1004 doesn't show up in ganglia - https://phabricator.wikimedia.org/T141360#2497199 (10Peachey88) [22:37:34] 06Operations, 10MediaWiki-Releasing, 10Parsoid, 06Release-Engineering-Team: debian signing keyid E84AFDD2 has expired - https://phabricator.wikimedia.org/T141400#2497211 (10greg) [22:44:11] (03PS1) 10BBlack: Revert "ssl_ciphersuite: drop non-FS AES256 options" [puppet] - 10https://gerrit.wikimedia.org/r/301307 [22:44:32] (03CR) 10BBlack: [C: 032 V: 032] Revert "ssl_ciphersuite: drop non-FS AES256 options" [puppet] - 10https://gerrit.wikimedia.org/r/301307 (owner: 10BBlack) [22:44:49] (03CR) 10Ladsgroup: [C: 031] Change-Prop: Increase maximum concurrency for ORES [puppet] - 10https://gerrit.wikimedia.org/r/301305 (owner: 10Ppchelko) [22:48:42] 06Operations, 10RESTBase, 06Services, 13Patch-For-Review, 15User-mobrovac: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2497236 (10GWicke) >>! In T136957#2491863, @MoritzMuehlenhoff wrote: >> - stdout is apparently ignored & does not make it into the systemd journal. >... [22:54:30] 06Operations, 10MediaWiki-Releasing, 10Parsoid, 06Release-Engineering-Team: debian signing keyid E84AFDD2 has expired - https://phabricator.wikimedia.org/T141400#2497264 (10greg) We should probably update https://wikitech.wikimedia.org/wiki/Releases.wikimedia.org (or add a new page and link to it from [[Re... [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160726T2300). [23:00:04] James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:38] Heya. [23:00:41] Is it just me? [23:00:54] Got someone to deploy it? [23:00:58] Nope. [23:01:05] * Reedy looks [23:01:11] well, if I can re-schedule a couple of patches from morning swat reedy? [23:01:14] It's "just" a de-deployment. :-) [23:01:27] (03PS3) 10Reedy: De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) (owner: 10Jforrester) [23:01:28] (window time went off due to elasticsearch failures) [23:01:54] mafk: I can do some if you need [23:02:09] (03CR) 10Reedy: [C: 032] De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) (owner: 10Jforrester) [23:02:12] Reedy: thanks, let me find the gerrit links [23:02:28] Reedy: I guess we should sync Common, then Init*, then extension-list? [23:02:35] (03Merged) 10jenkins-bot: De-deploy ImageMetrics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301009 (https://phabricator.wikimedia.org/T140952) (owner: 10Jforrester) [23:02:48] extension-list is mostly a no-op [23:02:51] Sure. [23:02:53] but yeah, out of CS first [23:03:01] https://gerrit.wikimedia.org/r/#/c/300758/ [23:03:17] and https://gerrit.wikimedia.org/r/#/c/300880/ [23:04:05] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Undeploy ImageMetrics (duration: 00m 27s) [23:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:04:31] looks to be gone from Special:Version [23:04:54] Reedy: On 1099? [23:05:11] Everywhere! [23:05:48] Gah. [23:05:49] Tut. [23:05:50] OK. [23:06:15] Sync the rest then. :-) [23:06:35] (03PS1) 10GWicke: Service::node: Capture stdout and stderr in journal [puppet] - 10https://gerrit.wikimedia.org/r/301309 (https://phabricator.wikimedia.org/T136957) [23:06:59] !log reedy@tin Synchronized wmf-config/: Remove rest of ImageMetrics config (duration: 00m 33s) [23:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:43] (03PS5) 10Reedy: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 (owner: 10MarcoAurelio) [23:07:53] (03CR) 10Reedy: [C: 032] Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 (owner: 10MarcoAurelio) [23:08:18] (03Merged) 10jenkins-bot: Bump event-schemas submodule commit to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300880 (owner: 10MarcoAurelio) [23:08:39] error: insufficient permission for adding an object to repository database /srv/mediawiki-staging/.git/modules/wmf-config/event-schemas/objects [23:08:40] ffs [23:08:55] WHY are numerous root:root [23:09:18] Reedy: Probably done in a hurry by o.ri? [23:09:46] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2497291 (10BBlack) >>! In T135762#2497082, @ellery wrote: > As far as I can tell, the proposed method also violates the more important property that users need to be randomly a... [23:09:58] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:17] PROBLEM - puppet last run on cp1040 is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:58] * Reedy goes to look for a root [23:11:53] perhaps scap needs something that blows up when files are owned by incorrect user? Although perhaps that's out of scope... [23:11:57] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures [23:12:16] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Puppet has 2 failures [23:12:17] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 1 failures [23:12:18] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Puppet has 1 failures [23:12:26] oh no, not again... [23:13:37] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 1 failures [23:14:06] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [23:14:07] PROBLEM - puppet last run on mw2066 is CRITICAL: CRITICAL: Puppet has 1 failures [23:14:55] looks like some kind of 500 server error from strontium [23:15:06] usual suspect then [23:15:11] so probably related to recent troubles I haven't really been following with puppetmasters falling over [23:18:08] !log reedy@tin Synchronized wmf-config/event-schemas: Bump event-schemas submodule commit to master (duration: 00m 28s) [23:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:25] mafk: ^ not sure if there's anything for you to obviously test for that [23:18:51] Reedy: well, I can do a git submodule update to test it updates to the latest commit [23:19:05] I mean, from the web [23:19:15] Submodule path 'wmf-config/event-schemas': checked out '4db9d40d28d61c53cdbca77059d9a2a6e714af89' [23:19:18] I don't think so [23:19:29] https://github.com/wikimedia/mediawiki-event-schemas/commit/4db9d40d28d61c53cdbca77059d9a2a6e714af89 [23:19:33] commit hashes match [23:19:34] WFM [23:19:47] yea i doubt anything about event-schemas is visible on the web, it would be seen either in eventbus or the monolog->kafka pipeline (depending on what changed) [23:20:01] (03PS8) 10Reedy: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [23:20:17] now people will have more updated files when cloning the mediawiki-config :) [23:22:57] (03CR) 10Reedy: [C: 032] Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [23:22:58] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Puppet has 1 failures [23:23:30] (03Merged) 10jenkins-bot: Disabling local uploads on ms.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/300758 (https://phabricator.wikimedia.org/T141227) (owner: 10MarcoAurelio) [23:24:07] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:27] RECOVERY - puppet last run on cp1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:25:32] !log reedy@tin Synchronized dblists/commonsuploads.dblist: Disabling local uploads on ms.wikipedia.org (duration: 00m 23s) [23:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:53] mw1099? [23:26:20] It's everywhere [23:26:54] ok, testing [23:27:33] WFM, mswiki's Special:ListGroupRights shows that upload rights have been removed from non-sysops, and upload link in the sidebar now points to commons UploadWizard [23:27:40] thank you Sir [23:30:08] Nemo_bis: that mswiki upload restriction is now done [23:34:27] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:34:36] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:36:46] RECOVERY - puppet last run on mw2066 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:36:47] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:36:56] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:36:56] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:38:16] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:39:24] Getting on every page mediawiki.org in the console for load.php [23:39:25] load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector:4 [V5ftWgpAAD8AAXroki4AAABG] 2016-07-26 23:08:10: Fatal exception of type "BadMethodCallException" [23:40:24] Still happening yp, for about 10 minutes now [23:40:25] load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector:4 [V5f0cgpAADoAAYz5SAcAAACH] 2016-07-26 23:38:26: Fatal exception of type "BadMethodCallException" [23:40:42] Reedy: ^ [23:41:26] * Reedy looks on fluorine [23:42:13] Reedy: https://logstash.wikimedia.org/app/kibana?#/doc/logstash-*/logstash-2016.07.26/mediawiki/?id=AVYpk6AVJW_HnhxrzSJX [23:42:29] Isn't that more likely .12 related than SWAT? [23:43:14] "Sessions are disabled for this entry point" [23:43:18] (03CR) 10Eevans: restbase-test: setup rsync for data from cassandra-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/301303 (owner: 10Dzahn) [23:43:19] thanks ebernhardson [23:43:33] > Kibana is loading. Give me a moment here. I'm loading a whole bunch of code. Don't worry, all this good stuff will be cached up for next time! [23:43:50] heh [23:44:04] Krinkle: I think you need to poke tgr or anomie as it looks session change related [23:44:12] Possibly only in .12... Depends if/what has changed recently [23:44:15] (Sorry these urls are so ugly...), but yes looks like strictly wmf.12 : https://logstash.wikimedia.org/app/kibana?#/dashboard/Fatal-Monitor?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-1h,mode:quick,to:now))&_a=(filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:level,negate:!t,value:NOTICE),query:(match:(level:(query:NOTICE)))),('$state':(store:appState),meta:(alias:!n,disa [23:44:18] it's MobileFrontend [23:44:22] err wow, that's worse than i thought [23:44:34] (the url, not the problem :P) [23:44:52] Is it worth rolling back to .11 for mw.org etc? [23:44:54] /group0 [23:45:45] I'm filing a bug at least [23:45:55] Reedy, Krinkle: matt_flaschen was looking at that one earlier. MobileFrontend bug. See https://gerrit.wikimedia.org/r/#/c/301310/. [23:46:27] oh, bug already exists [23:47:08] Yeah, T141386 [23:47:08] T141386: onResourceLoaderGetConfigVars can not depend on user-specific info for wikidata settings - https://phabricator.wikimedia.org/T141386 [23:47:56] * Reedy cherry picks to .12 [23:48:42] No immediate need to merge to master [23:48:55] ebernhardson: FWIW there is a link to the individual logstash record when you open the details, e.g. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2016.07.26/mediawiki?id=AVYpdym414thRtYyfGcl [23:49:25] it will still contain all the dashboard config crap but that can be cut off [23:49:36] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:51:12] tgr: yea i had linked that before, i was trying to link the fatalmonitor dashboard but with a custom filter so it showed all the wikis/deploy versions that had this specific error [23:51:22] Reedy, you're deploying it now? [23:51:35] I'm gonna when jenkins merges it [23:51:41] Okay, thanks. [23:51:49] tgr: turns out though there is a 'share' button which initially has the ugly url, but then clicking the button that kinda looks like -><- generates a short (er) url [23:59:04] !log reedy@tin Synchronized php-1.28.0-wmf.12/extensions/MobileFrontend/: Deploy revert for group0 for T141386 (duration: 00m 30s) [23:59:05] T141386: onResourceLoaderGetConfigVars can not depend on user-specific info for wikidata settings - https://phabricator.wikimedia.org/T141386 [23:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:18] Krinkle: ^^