[00:03:25] yes [00:03:42] I got the first reset [00:03:45] but not the second [00:03:49] as was only as moderator [00:03:57] and the old pass obviouslly doesn't work [00:05:32] right [00:06:12] I messaged another person who I think was a co-listadmin at the time. They should (hopefuly) have the current password [00:06:34] hmm [00:06:46] the mailman page only lists that email address [00:07:05] Yeah, they removed themselves from all the lists they were admin on. [00:07:09] so he probably didn't receive it either [00:07:30] maybe depending on the date he self-removed :P [00:07:45] yeah [00:13:36] (03CR) 10Madhuvishy: "One question, otherwise +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305804 (owner: 10Yuvipanda) [00:16:11] (03PS1) 1020after4: Contint: add pyflakes/flake8 for pep8 linting [puppet] - 10https://gerrit.wikimedia.org/r/306851 [00:16:27] bd808: yuvipanda congrats, now we just need to move jouncebot and stashbot into diffusion :P [00:16:50] (the two on my list) [00:17:20] (03PS2) 1020after4: Contint: add pyflakes/flake8 for pep8 linting [puppet] - 10https://gerrit.wikimedia.org/r/306851 [00:41:44] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: puppet fail [00:52:44] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2584764 (10mmodell) The custom ssh setup happened in {T100519} and we need to duplicate that stuff in `codfw` [01:06:43] (03PS1) 10Ppchelko: Change-Prop: Rerender summary on wikidata item update [puppet] - 10https://gerrit.wikimedia.org/r/306857 [01:09:33] (03CR) 10Ppchelko: "Puppet compiler: https://puppet-compiler.wmflabs.org/3847/" [puppet] - 10https://gerrit.wikimedia.org/r/306857 (owner: 10Ppchelko) [01:10:23] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:21:00] (03PS14) 10Yuvipanda: Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [02:26:11] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.16) (duration: 10m 55s) [02:32:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Aug 26 02:32:04 UTC 2016 (duration 5m 53s) [02:37:45] (03PS3) 10Faidon Liambotis: nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 [02:37:47] (03PS1) 10Faidon Liambotis: nagios: make check_sslxNN multithreaded [puppet] - 10https://gerrit.wikimedia.org/r/306866 [02:37:49] (03PS1) 10Faidon Liambotis: nagios: fix check_ssl with a newer IO::Socket::SSL [puppet] - 10https://gerrit.wikimedia.org/r/306867 [02:40:58] (03CR) 10Faidon Liambotis: "Reviving this a year and a half later! The multithreaded changes in Ib4b72f0cb431d59fde20324468078ac42c054335 should make this run in an a" [puppet] - 10https://gerrit.wikimedia.org/r/182306 (owner: 10Faidon Liambotis) [02:51:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 20 probes of 246 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:56:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 30 probes of 240 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:01:34] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:02] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [03:02:32] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:33] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 240 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:03:33] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [03:03:52] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 246 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:04:24] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [03:08:02] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 9 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [03:08:22] PROBLEM - cassandra service on aqs1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [03:09:04] PROBLEM - cassandra CQL 10.64.48.117:9042 on aqs1003 is CRITICAL: Connection refused [03:09:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [03:11:43] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [03:12:22] RECOVERY - cassandra service on aqs1003 is OK: OK - cassandra is active [03:17:43] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [03:17:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [03:23:04] RECOVERY - cassandra CQL 10.64.48.117:9042 on aqs1003 is OK: TCP OK - 0.002 second response time on port 9042 [03:34:03] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [03:34:13] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [03:36:03] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [03:36:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [04:15:29] 06Operations, 10Librarization, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848#2584958 (10Krinkle) [04:38:43] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:38:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:42:43] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [04:42:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [04:52:43] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:52:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:56:43] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [04:56:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:35:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [05:37:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [05:47:32] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [05:47:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [05:50:11] (03CR) 10Madhuvishy: [C: 031] Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 (owner: 10Yuvipanda) [05:53:24] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:53:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:59:42] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:03:43] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:18:03] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:19:43] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:19:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [06:38:02] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:38:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:40:23] PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 180363268 for key PRIMARY on query. Default database: heartbeat. Query: [snipped] [06:40:53] PROBLEM - MariaDB Slave SQL: m3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 171970597 for key PRIMARY on query. Default database: heartbeat. Query: [snipped] [06:41:02] PROBLEM - MariaDB Slave SQL: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Error Duplicate entry 180359173 for key PRIMARY on query. Default database: heartbeat. Query: [snipped] [06:43:28] (03PS4) 10Muehlenhoff: role::logging::mediawiki: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/303140 [06:44:22] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:12] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:03] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 23 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:02:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:04:23] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:16:32] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:16:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [07:19:33] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [07:20:43] RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:21:25] RECOVERY - MariaDB Slave SQL: m3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:21:25] RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:33:03] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:33:22] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:39:44] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2585146 (10jcrespo) [07:41:37] !log deploying schema change on s3 hosts T139090 [07:41:39] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [07:47:27] (03PS3) 10Giuseppe Lavagetto: puppetmaster: add puppetmaster::web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/306702 (https://phabricator.wikimedia.org/T143869) [07:52:34] PROBLEM - DPKG on ms-be1023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:52:40] 06Operations, 10netops: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2585165 (10elukey) Vacation times delayed this task, sorry :) I am almost positive that these rules can be removed, but I'd like a final confirmation from @JAlleman... [07:54:01] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add puppetmaster::web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/306702 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [07:54:08] (03PS4) 10Giuseppe Lavagetto: puppetmaster: add puppetmaster::web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/306702 (https://phabricator.wikimedia.org/T143869) [07:54:22] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: add puppetmaster::web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/306702 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [07:55:22] jynus: good morning! I noticed a bunch of "cannot connect to db1080" enwiki yesterday [07:55:31] just before I did the train of wmf.16 to group2 (wikipedias) [07:55:38] ah! [07:55:52] apparently resolved. Looks similar to an earlier spike at 3pm UTC which IIRC was related to CirrusSearch [07:56:01] I asked ostriches he said he saw nothing on software side [07:56:03] so I havent roll backed the train, assuming it was the same issue going on [07:56:24] one can clearly see the spikes on logstash > mysql > 1 day [07:56:37] https://logstash.wikimedia.org/goto/812a50ed6208078fec34f15e05fe5b27 [07:56:53] maybe it was not just db1080 [07:56:58] but definitely enwiki [07:57:14] it was all main traffic [07:57:20] roughly half on 10.64.32.115 and rest on 10.64.16.101 and 10.64.0.92 [07:57:28] saturating connections [07:57:49] traffic from jobs or just regular traffic to the site ? :( [07:58:32] <_joe_> hey, just a heads up [07:58:38] <_joe_> there is something wrong with palladium [07:58:50] <_joe_> I'm unsure if it's going to cause a shower of errors [07:59:17] (oh and also https://grafana.wikimedia.org/dashboard/db/mysql is very great) [07:59:32] I hate someone merged a change about a query I absolutely disagree with [07:59:44] <_joe_> jynus: revert it :P [08:00:00] disagree == veto ? [08:00:17] if you need another voice feel free to poke me any time for an additional CR-2 :] [08:00:19] no, if I revert it, I could make things worse [08:00:34] the query is new, and should be disabled [08:00:41] and it should not query the main servers [08:01:06] (03PS1) 10Giuseppe Lavagetto: puppetmaster: fix typo in web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/306877 [08:01:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster: fix typo in web_frontend [puppet] - 10https://gerrit.wikimedia.org/r/306877 (owner: 10Giuseppe Lavagetto) [08:02:23] I think I will create an UBN [08:02:30] _joe_ I was about to use confctl to depool mw servers, should I stop? Or were you referring only to puppet? [08:02:52] <_joe_> elukey: puppet [08:03:02] <_joe_> elukey: or better, apache [08:03:07] <_joe_> but it was a PEBKAC [08:03:40] <_joe_> oh and ofc, today I'm going to merge a few changes on palladium, it will cause showers of puppet fails [08:04:25] ahhh okok thanks [08:04:26] :) [08:04:31] hashar, how can I see specifically elastic deployment? [08:07:34] jynus: via the SAL if people logged it, scap surely does [08:07:36] (03PS3) 10Giuseppe Lavagetto: puppetmaster: extract the passenger config from the virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/306703 [08:07:53] for the extension, you can look at Elastica and CirrusSearch extension wmf branches see whether any commit got merged [08:07:58] they are usually deployed right away [08:08:17] for the ElasticSearch daemon/backend, SAL would be the only way. Maybe dcausse knows? [08:08:47] so whatever is deployed to the "CirrusSearch extension" gets deployed immediately? [08:09:00] no train or anything? [08:09:02] usually yeah [08:09:11] what we do is that we get the change reviewed on master and merged there [08:09:11] wow [08:09:15] then cherry-pick to the wmf-branch [08:09:17] (03PS2) 10Ladsgroup: ores: Define extra config for ores [puppet] - 10https://gerrit.wikimedia.org/r/306839 (https://phabricator.wikimedia.org/T143567) [08:09:27] and usually CR+2 the wmf change then scap it [08:09:36] though sometime people merge before the swat deployment window [08:09:43] well, I highly disagree with https://gerrit.wikimedia.org/r/#/c/306803/1 [08:09:53] scap definitely emit !log notifications. And scap also logs everything to logstash [08:09:54] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:10:12] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [08:10:34] and with https://gerrit.wikimedia.org/r/#/c/306738/ [08:11:00] and either code was deployed in a rush [08:11:01] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Enable access to Wikipedia Tulu (tcywiki) on labs replicas - https://phabricator.wikimedia.org/T142223#2585214 (10AlexMonk-WMF) [08:11:06] <_joe_> jynus: if it's creating issues, revert it. If not, write to erikb [08:11:06] or it is now causing issues [08:11:09] 10Blocked-on-Operations, 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: maintain-replicas.pl unmaintained, unmaintainable - https://phabricator.wikimedia.org/T138450#2585217 (10AlexMonk-WMF) [08:11:31] 06Operations, 10DBA, 06Labs, 10Tool-Labs: Replicate wikimania2017wiki to labs - https://phabricator.wikimedia.org/T126096#2585224 (10AlexMonk-WMF) [08:11:44] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2585232 (10flemmerich) I signed the document (step 2) and requested developer access (step 3). [08:11:44] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2585212 (10AlexMonk-WMF) [08:12:11] well, he asked me at https://phabricator.wikimedia.org/T143932#2583686 and deployed 8 minutes later [08:12:35] jynus: so yeah that change https://gerrit.wikimedia.org/r/#/c/306803/1 must have been deployed [08:12:45] in the middle of a schema change [08:12:51] and indeed on https://tools.wmflabs.org/sal/production : 23:15 Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch: (no message) (duration: 00m 57s) [08:13:17] so looks like an emergecy patch to me [08:13:27] but the query should not be there in the first place [08:13:30] that has been done during the Evening SWAT of yesterday https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0August.C2.A025 [08:13:50] Erik B (ebernhardson) [wmf.16] 306803 - Choose a better query plan for link counting [08:14:17] I guess they have identified the query required a hint as to which index to use ? [08:14:41] might have been in response to the few MySQL spike we had yesterday [08:14:45] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: extract the passenger config from the virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/306703 (owner: 10Giuseppe Lavagetto) [08:15:23] yes, and now something worse could happen [08:15:31] 06Operations, 10DBA, 06Labs, 10Tool-Labs: Replicate wikimania2017wiki to labs - https://phabricator.wikimedia.org/T126096#2004612 (10AlexMonk-WMF) Needs a maintain-replicas run [08:16:07] I would assume good faith. I guess they either did not know about the index reordering / tweak / schema change [08:16:22] I do not assume bad faith [08:16:23] or took the decision that the index hint was needed right now without waiting for the schema update [08:18:26] !log upgrading httpd to 2.4.10-10+deb8u6+wmf2 on mw127[2345] [08:19:54] the patch should probably not use regular slaves in the first place, which is what caused issues; probably untested on enwiki; and it was fixed with a bad solution; it should have been reverted instead [08:20:30] jynus: should I do something? basically we moved to mysql instead of elastic to count incoming links [08:20:52] the first query plan was really bad and erik added a hint which seemed to speed up the query [08:21:07] and wanted to talk to you about that [08:21:19] that is a bad idea, revert first rather and poorly fix [08:21:57] no issues would have happened if an 'vslow' slave was chosen [08:22:20] jynus: Once you're free. I need several minutes of your time :) [08:22:24] did anyone reviewed that patch? [08:22:55] the one with the use index hint? [08:23:02] the new query [08:23:46] I understand the need for a quick fix, even if it is a bad idea [08:23:56] but how was that query added in the first place? [08:24:42] jynus: as I said, to use mysql instead of elastic for incoming links counting [08:24:53] we thought it was ok [08:25:13] I am asking for the gerrit patch, please [08:25:16] ok [08:25:37] jynus: I have copy pasted the IRC log from the deployment https://phabricator.wikimedia.org/T143932#2585266 [08:26:03] looks like the fix is a workaround for a perceived oddity in mysql. It apparently fixed it up and the task got filled for a proper fix to be found [08:26:09] jynus: https://gerrit.wikimedia.org/r/#/c/305325/ [08:26:50] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add ca and ca_server settings to frontend [puppet] - 10https://gerrit.wikimedia.org/r/306827 (https://phabricator.wikimedia.org/T143869) [08:28:14] if you have a patch to deploy and want my assistance please let me know. Will be happy to help [08:28:18] (03PS1) 10Ladsgroup: beta: use enwiki model for ores in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306881 (https://phabricator.wikimedia.org/T143567) [08:28:39] ok, dcausse my request is, can we revert 305325 and then review more thoroughly the patch/wait for the schema change to complete? [08:28:56] I already have some suggestions [08:29:24] jynus: ok [08:30:55] _joe_: I'm around, please give me a ping when you're ready to do the pupper compiler deploy :-) [08:31:19] I think if a single extension has caused 2 major issues in 24 hours maybe we should slowdown its deployment or change its deployment strategy? [08:31:32] (03CR) 10Hashar: [C: 032] beta: use enwiki model for ores in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306881 (https://phabricator.wikimedia.org/T143567) (owner: 10Ladsgroup) [08:31:58] (03Merged) 10jenkins-bot: beta: use enwiki model for ores in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306881 (https://phabricator.wikimedia.org/T143567) (owner: 10Ladsgroup) [08:32:27] also dcausse I am very willing to give reviews about anything regarding the db [08:32:38] I am never too busy for that [08:32:43] jynus: understood [08:33:19] mw127[2345] updated (httpd) and working fine [08:34:05] I have nothing against the idea, I just think I can help making it better and faster :-) [08:34:18] elukey: that is for the infamous AH_xx errors isn't it ? [08:35:11] hashar: yep :) [08:36:11] so all the new jessie based appservers have the new patched httpd 2.4.10 [08:36:24] the immediate changes would be to "$dbr = wfGetDB( DB_SLAVE, 'vslow' );" and once we finish the schema change, fix the query plan by regenerating table stats [08:36:52] !log installing libcrypt security updates [08:36:54] that is way better than adding a USE INDEX [08:37:25] hashar: now that you mentioned I'd need to update the phab task :) [08:37:34] elukey: really thanks for dealing with those AH errors they are quite annoying :D Also remember some of them are entirely filtered by logstash, might want to remove those filters when the issue is known to be fixed [08:37:39] (03CR) 10Ladsgroup: [C: 031] "I cherry-picked it in beta cluster and works just fine" [puppet] - 10https://gerrit.wikimedia.org/r/306839 (https://phabricator.wikimedia.org/T143567) (owner: 10Ladsgroup) [08:37:40] for me, I probably should add a watchdog to the new servers and improve my monitoring [08:37:58] elukey: and if you feel brave, you could update the apaches on the beta cluster as well: deployment-mediawiki* instances [08:39:01] If we still need an index hint, we must document it [08:39:41] becase that is a fix now and a breakage tomorrow (in this case, literally, as the schema change was ongoing as we speak) [08:40:42] hashar: yes I am planning to remove the logstash filters when all the appservers will run with 2.4.10. The ubuntu version that we are running is quite old (2.4.7) and we decided not to patch/update it. [08:41:55] I can definitely update the beta cluster (if not ubuntu) but I'd need some help with the hostnames [08:42:22] oh, logbot down [08:42:43] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add ca and ca_server settings to frontend [puppet] - 10https://gerrit.wikimedia.org/r/306827 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [08:43:17] there is also the fact that (assuming this is not code run interactively) sometimes is better to run it slower, in smaller queries, than a large, complex one [08:43:39] Amir1, tell me [08:44:00] jynus: okay, we are deploying new version of ores on Monday [08:44:12] elukey: deployment-mediawiki01.deployment-prep.eqiad.wmflabs deployment-mediawiki02.deployment-prep.eqiad.wmflabs and deployment-mediawiki03.deployment-prep.eqiad.wmflabs [08:44:16] it will invalidate almost everything in the ores_classification table [08:44:30] elukey: there is also a saltmaster on deployment-salt02.deployment-prep.eqiad.wmflabs [08:44:31] invalidate(?), please link code [08:44:35] because with new models old scores will be useless [08:44:53] it is ok, if it has to be done, it has to be done [08:45:07] just make sure you do not do a DELETE FROM table; [08:45:14] let https://github.com/wikimedia/mediawiki-extensions-ORES/blob/master/includes/Cache.php#L77 [08:45:19] thank you [08:45:21] elukey: with logstash being publicly available from https://logstash-beta.wmflabs.org/ with an "apache" dashboard that shows at least AH01067 right now [08:45:57] the good new is per this way of thinking, ores tables won't grow much [08:45:59] Amir1, at first glance looks good, let me see the logic more in depth [08:46:17] because we change model from time to time [08:46:23] then we clean up the table [08:46:30] hashar: thanks but they are ubuntus :( [08:46:38] we'd need to upgrade them to jessie [08:46:56] Amir1, I assume you will run the population afterwards, too? [08:46:58] has the mw app servers in production moved to Jessie ?? [08:47:07] <_joe_> elukey: that's a good start [08:47:12] yeah, but it will add around 5K -10K not more [08:47:13] <_joe_> thanks for taking care of that [08:47:21] <_joe_> (ugrading the beta mediawikis) [08:47:24] <_joe_> :D [08:47:43] The idea, Amir, is not to use DDL e.g. TRUNCATE [08:47:56] and make transactions no longer than 1 second [08:47:59] (03PS2) 10Giuseppe Lavagetto: puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) [08:48:02] _joe_ :D [08:48:07] the bad news is that these tables are already really big, I'm afraid it might cause lag [08:48:34] It deletes in batches of 1K rows [08:49:22] no, if you follow the above recommendations, it will not [08:49:25] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [08:49:28] hashar: yes, we're partly on jessie with the app servers [08:49:32] (The good part was talking about storage issue sit might cause) [08:49:59] note that deleting rows will not free disk space, that is a common misconception [08:50:20] jynus: can you elaborate your recommendations more. Just to be sure [08:50:34] oh, my bad [08:50:36] yes, you applied all of it, for what I can see [08:50:48] read the ids [08:51:07] delete them in small batches (1000 is a good size) [08:51:11] then wait for slaves [08:51:41] if you want to be 100% sure, we can try on a smaller wiki first [08:51:48] check the queries [08:51:54] and then do it for larger ones [08:51:58] we do them altogether [08:52:05] in the window [08:52:13] but surely we can start with smaller ones [08:52:24] 06Operations, 06Performance-Team, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2585291 (10elukey) Summary and news: 1) Thanks to @Joe a... [08:53:04] I check queries and lags [08:53:33] I assume 'oresc_rev' => $ids expands to an in? [08:54:23] (03PS3) 10Giuseppe Lavagetto: puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) [08:54:25] I do not think 1000 ids could take more than 32MB, which would be the query size limit [08:54:56] jynus: yes [08:54:58] or does it expand it to individual DELETES? [08:55:06] in any case, looks good to me [08:55:26] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [08:55:43] are there tables on enwiki alread, Amir1? [08:56:01] yes, we deployed several days ago [08:56:11] so it's not super big for enwiki [08:56:21] but wikidata, it'll be huge [08:57:31] also it seems we have another schema change [08:57:34] (03PS1) 10Filippo Giunchedi: puppet_compiler: add script to update facts [puppet] - 10https://gerrit.wikimedia.org/r/306890 [08:57:35] https://gerrit.wikimedia.org/r/#/c/306854 [08:57:41] https://phabricator.wikimedia.org/T143962 [08:57:41] let's see wikidata [08:58:05] BTW, you are doing really great on table normalization [08:58:13] separating the model in a separate table [08:58:40] hashar: how painful would it be to upgrade deployment-mediawiki to jessie? [08:58:45] I mean all of them [08:59:44] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2585295 (10fgiunchedi) looks like the error followed the swap, now sdb is reported with failed commands ``` [ 24.549309] sd 0:1:0:1: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driver... [09:00:09] It was hoo's idea :) [09:01:22] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 244 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:01:48] not sure if oresc_is_predicted will be very helpful, but all depends on the queries done [09:02:31] it's not yet but when we include article quality models, it will be [09:02:43] and detox [09:02:56] I am ok with that, although aaron lately has been objecting to some unique indexes [09:03:04] I am ok as it is [09:03:17] (again, without looking at the queries) [09:03:33] just add a schema change ticket when it is deployed [09:03:44] I should talk to him to see that :) [09:03:45] sure [09:03:46] it should not take much time to deploy [09:03:59] index changes are very easy [09:04:10] nah, I'm waiting for the parent patch to get fixed and then I +2 both [09:04:35] and in one week it's in the cluster [09:06:19] (03PS3) 10Muehlenhoff: openldap_labs: Limit to production networks and labs networks [puppet] - 10https://gerrit.wikimedia.org/r/303181 [09:06:34] (03PS1) 10Racodond: WIP - Install SonarQube server [puppet] - 10https://gerrit.wikimedia.org/r/306892 [09:06:44] PROBLEM - puppet last run on restbase2003 is CRITICAL: CRITICAL: Puppet has 1 failures [09:07:03] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [09:07:37] (03CR) 10jenkins-bot: [V: 04-1] WIP - Install SonarQube server [puppet] - 10https://gerrit.wikimedia.org/r/306892 (owner: 10Racodond) [09:10:05] (03CR) 10Filippo Giunchedi: [C: 031] Puppetize static configuration for prometheus-mysqld-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306675 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:10:27] (03PS2) 10Racodond: WIP - Install SonarQube server [puppet] - 10https://gerrit.wikimedia.org/r/306892 [09:15:22] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [09:16:33] so dcausse, let's discuss how to move forward on T143932, I do not oppose to the feature, it just needs some tweaks [09:16:34] T143932: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932 [09:17:07] jynus: sure, revert is in progress, it took time sorry it was not trivial [09:17:19] of course [09:17:49] (03PS1) 10Ema: Upgrade cp4006 (ulsfo cache_upload) to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/306893 (https://phabricator.wikimedia.org/T131502) [09:18:09] I think the main benefit for us using mysql is a big clenup in the code [09:18:30] I wonder, because things like flags-filtering-like [09:18:44] usually is way more efficient on elastic/search engines [09:18:57] I assume it is due to the range? [09:19:53] hashar, I think in the past schema changes were put on Deployments [09:20:34] the problem right now with that is that there is a schema change ongoing almost 100% of the time (as it can be done without affecting writes) [09:22:00] jynus: well it is not a problem per see [09:22:23] we can get code deployed and schema changes ongoing as long as we make sure they do not conflict with each other [09:22:35] I think I'm ok continue to use elastic if we think it's hard for mysql to optimize these big OR/AND queries [09:23:03] well, you (not necesarilly you or your team) should be aware of those preciselly in case there is an incompatibility [09:23:08] (03PS4) 10Giuseppe Lavagetto: puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) [09:23:17] (03PS1) 10Muehlenhoff: Remove $ALL_NETWORKS ferm definition [puppet] - 10https://gerrit.wikimedia.org/r/306894 [09:23:26] right now I centralize that, I wonder if that can cause issues? [09:23:36] jynus: yes absolutely, my bad [09:23:40] as I am not normally aware of deployments [09:23:48] dcausse, I think we can do that [09:24:04] I think it only needs some tweaks [09:24:08] from my ITIL memories, that would be handled when preparing the deployment/operations calendar and check whether two tasks might clash [09:24:09] hashar: mind reviewing https://gerrit.wikimedia.org/r/#/c/306883/ ? [09:24:12] and it was not your bad [09:24:17] :-) [09:24:17] (03PS3) 10Racodond: WIP - Install SonarQube server [puppet] - 10https://gerrit.wikimedia.org/r/306892 [09:24:36] gehel: some sonarqube going on https://gerrit.wikimedia.org/r/306892 ! :D [09:24:38] but first, I want to dinish the ongoing schema change [09:24:43] *finish [09:25:01] hashar: yep, David is in my kitchen... [09:25:08] then we try with some suggestions [09:25:21] jynus: makes sense [09:25:58] dcausse: so that is reverting the change that required the index to be mentioned isn't it ? [09:26:21] in particular, I am interested on https://phabricator.wikimedia.org/T143911 [09:26:22] hashar: yes, the revert was not trivial because we moved to short array syntax in the meantime :/ [09:26:29] if it is a revert I guess it is all fine. Might want to check on beta that it goes fine [09:26:33] then we can sneak deploy it [09:26:36] as it would have minimized the latest 2 issues [09:26:43] hashar: +1 [09:26:48] then I am wondering whether the hook will still work properly [09:27:07] but I guess that can be tested on beta (hopefully) [09:27:20] hashar: which hook? [09:27:25] let's go as slow as necesary, there are no ongoing issues [09:27:31] https://gerrit.wikimedia.org/r/#/c/306883/4/CirrusSearch.php [09:27:57] and if there are no urgency, might want to wait for E B to review. Cause really I have no idea about the patch impact [09:28:08] +1 [09:28:13] ok [09:28:14] this is not an emergency [09:28:33] so if the DB is fine and search works all fine [09:28:41] at least not now [09:29:02] lets keep it this way and then either: (a) get the revert patch reviewed by EB (b) wait for schema change to complete and tweak code as needed which I guess is removing the force index [09:29:29] at least we have a revert patch ready if needed! [09:29:40] sure, let me know [09:30:31] well, even if it stays, the query is not very pretty [09:30:52] (03CR) 10Ema: [C: 032] Upgrade cp4006 (ulsfo cache_upload) to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/306893 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [09:30:57] and I am not referring to how it looks, but performance-wise [09:31:40] my suggestion, to be safe this doesn't happen during the weekend would be a smaller patch [09:32:02] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [09:32:03] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:32:14] (03PS5) 10Giuseppe Lavagetto: puppetmaster: split backend and frontend vhosts [puppet] - 10https://gerrit.wikimedia.org/r/306828 (https://phabricator.wikimedia.org/T143869) [09:32:22] !log upgrading kernels on labtestmetal2001/labtestneutron2001/labtestvirt2001 [09:33:42] <_joe_> gosh jenkins [09:36:33] dcausse, hashar https://gerrit.wikimedia.org/r/306896 much easier and will minimize potential production issues [09:37:05] I have this puppet patch that only affects the beta cluster and I cherry-picked it there and tested and works just fine. [09:37:06] https://gerrit.wikimedia.org/r/#/c/306839/ [09:37:28] It would be great if someone reviews it, thanks [09:37:56] jynus: side note, I am wondering whether all wfGetDB() should default to pass the method name so one can easily reroute traffic as needed [09:37:59] but I digress [09:38:20] hashar, there is some QA work there indeed [09:38:32] but I would not know how to handle [09:38:51] like new code going first to a small number of servers [09:39:05] and then it "graduates" to going to all servers [09:39:08] :-) [09:39:26] well for now it is master branches going to beta cluster. But that one is not carefully looked at [09:39:37] then we do the roll wikis per wikis, but that is indeed on all servers [09:39:40] beta is not valid for this issue [09:39:58] a recent change has been to first push the code / switch the version on a dozen of app server. Then query logstash for potential spike of errors [09:40:01] because that query (all queries) will take 0 second on a db the size of beta [09:40:05] if there are any error, autorollback [09:40:08] even with a large db [09:40:10] else proceed to the rest of the fleet [09:40:21] hashar: you can canary deploy on jobqueue servers? [09:40:33] this code only runs on jobrunners [09:40:43] yeah [09:40:48] we can cherry pick the patch to wmf.16 [09:40:56] pull it on the deployment server tin.eqiad.wmnet [09:41:06] then on the canary jobqueue servers run "scap pull" [09:41:18] I will definitely try to work on https://phabricator.wikimedia.org/T143911, but that will not be easy in practice [09:41:37] ok, waiting for cindy to +1 then I'l +2 [09:41:42] neat [09:42:22] (I am wondering why folks seek to improve cross team collaboration. IRC works well!) [09:42:31] :) [09:42:46] maybe hashar we could have the canary servers to override load balancer code? [09:43:01] but again, not sure that would be useful [09:43:05] no idea [09:43:15] on server has very little traffic [09:43:18] *one [09:43:19] but yeah theorically one can hack the LB code direclty on on of the job server [09:43:29] no clue about all the impacts / havoc that might cause though [09:43:37] oh, for my ticket, no [09:43:48] that must be done properly, manually changing the code [09:44:02] as some jobs already uses recentchanges slaves [09:44:10] and others [09:45:22] dcausse, is it ok for that to run slower durin the weekend? [09:45:35] because having only one server will definitelly do that [09:46:12] (one per shard) [09:46:33] jynus: if we are slow on this one we could overload the queue I suppose :/ [09:46:39] but hard to tell :( [09:47:03] well, not 1 hour like the original query [09:47:18] but 10 -> 30 ms or something like that [09:47:42] +20 ms is not an issue imho [09:48:20] I'll have to monitor the jobqueue [09:48:30] but sounds good to me [09:49:08] let me give you exact numbers [09:50:42] Before: 0.00 seconds After: 0.01 seconds [09:55:45] (03PS2) 10Giuseppe Lavagetto: puppetmaster: move vhost from passenger class [puppet] - 10https://gerrit.wikimedia.org/r/306829 (https://phabricator.wikimedia.org/T143869) [09:56:23] !log cp4006 upgraded to varnish4 and repooled (T131502) [09:56:25] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [09:56:26] sounds fine, cirrus jobs that could be affected run in ~370ms now [09:58:25] if you need +1 or assistance to deploy let me now [09:58:35] (busy writing some python but can interrupt any time) [09:59:10] jynus: I'm ok to deploy your fix, I'll monitor the jobqueue usage for any odd behaviors [09:59:48] that is only temporal [10:00:16] sure [10:00:23] vslow is not a good group when dumps start [10:01:08] so we should migrate to 'jobq' (when we create it) and deploy some fixes by then [10:01:30] the schema change should finish next week [10:01:50] ok [10:02:15] the USE are only short-term fixes and will break in the long-term [10:02:40] I will help trying to avoid it next week [10:02:51] thanks [10:03:01] no, thanks to you [10:04:37] (03PS5) 10Jcrespo: Puppetize static configuration for prometheus-mysqld-exporter [puppet] - 10https://gerrit.wikimedia.org/r/306675 (https://phabricator.wikimedia.org/T126757) [10:05:09] (03CR) 10Jcrespo: "> Could be removed I guess once eqiad is in place too." [puppet] - 10https://gerrit.wikimedia.org/r/306675 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:07:30] (03PS3) 10Giuseppe Lavagetto: puppetmaster: move vhost from passenger class [puppet] - 10https://gerrit.wikimedia.org/r/306829 (https://phabricator.wikimedia.org/T143869) [10:07:36] (03CR) 10Jcrespo: [C: 032] "Only after 296596 is refactored." [puppet] - 10https://gerrit.wikimedia.org/r/306675 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:08:04] (03CR) 10Jcrespo: "s/after/until/." [puppet] - 10https://gerrit.wikimedia.org/r/306675 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [10:10:28] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: move vhost from passenger class [puppet] - 10https://gerrit.wikimedia.org/r/306829 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [10:10:36] (03PS4) 10Giuseppe Lavagetto: puppetmaster: move vhost from passenger class [puppet] - 10https://gerrit.wikimedia.org/r/306829 (https://phabricator.wikimedia.org/T143869) [10:10:57] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster: move vhost from passenger class [puppet] - 10https://gerrit.wikimedia.org/r/306829 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [10:14:53] PROBLEM - Check size of conntrack table on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:02] PROBLEM - dhclient process on rhodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:22] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:53] PROBLEM - puppet last run on prometheus1002 is CRITICAL: CRITICAL: Puppet has 5 failures [10:16:22] PROBLEM - puppetmaster backend https on rhodium is CRITICAL: Connection refused [10:16:52] RECOVERY - Check size of conntrack table on rhodium is OK: OK: nf_conntrack is 0 % full [10:16:53] RECOVERY - dhclient process on rhodium is OK: PROCS OK: 0 processes with command name dhclient [10:17:24] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 354 bytes in 0.157 second response time [10:17:53] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Puppet has 3 failures [10:18:04] (03Draft2) 10Paladox: Add git.legoktm.com to system.gitconfig.erb for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) [10:18:09] (03Draft1) 10Paladox: Add git.legoktm.com to system.gitconfig.erb for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/306900 (https://phabricator.wikimedia.org/T143969) [10:18:12] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [10:18:23] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.081 second response time [10:18:32] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: puppet fail [10:19:22] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: puppet fail [10:20:02] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: puppet fail [10:20:02] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: puppet fail [10:20:02] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: puppet fail [10:20:02] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [10:20:03] PROBLEM - puppet last run on auth1001 is CRITICAL: CRITICAL: puppet fail [10:20:03] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail [10:20:03] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: puppet fail [10:20:14] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail [10:20:14] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: puppet fail [10:20:14] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: puppet fail [10:20:14] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: puppet fail [10:20:14] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: puppet fail [10:20:15] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: puppet fail [10:20:15] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: puppet fail [10:20:15] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: puppet fail [10:20:22] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [10:20:22] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: puppet fail [10:20:22] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: puppet fail [10:20:22] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail [10:20:23] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: puppet fail [10:20:23] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Puppet has 26 failures [10:20:23] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: puppet fail [10:20:23] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: puppet fail [10:20:23] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: puppet fail [10:20:24] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [10:20:24] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [10:20:32] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail [10:20:32] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [10:20:32] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Puppet has 8 failures [10:20:32] PROBLEM - puppet last run on db1088 is CRITICAL: CRITICAL: puppet fail [10:20:32] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Puppet has 34 failures [10:20:33] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: puppet fail [10:20:33] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: puppet fail [10:20:33] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: puppet fail [10:20:33] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: puppet fail [10:20:34] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Puppet has 32 failures [10:20:34] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: puppet fail [10:20:46] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: puppet fail [10:20:46] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: puppet fail [10:20:47] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 28 failures [10:20:47] PROBLEM - puppet last run on sinistra is CRITICAL: CRITICAL: puppet fail [10:20:48] PROBLEM - puppet last run on mw2220 is CRITICAL: CRITICAL: puppet fail [10:20:48] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: puppet fail [10:20:49] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: puppet fail [10:20:49] PROBLEM - puppet last run on mw2240 is CRITICAL: CRITICAL: puppet fail [10:20:50] PROBLEM - puppet last run on wtp2001 is CRITICAL: CRITICAL: puppet fail [10:20:50] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: puppet fail [10:20:51] PROBLEM - puppet last run on pybal-test2001 is CRITICAL: CRITICAL: Puppet has 24 failures [10:20:52] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: puppet fail [10:20:52] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [10:20:52] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 40 failures [10:20:53] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: puppet fail [10:20:53] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 23 failures [10:20:54] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [10:20:54] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [10:21:03] <_joe_> this is all expected ^^ [10:21:12] PROBLEM - puppet last run on mw1263 is CRITICAL: CRITICAL: puppet fail [10:21:12] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: puppet fail [10:21:12] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Puppet has 33 failures [10:21:12] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [10:21:12] PROBLEM - puppet last run on elastic2019 is CRITICAL: CRITICAL: Puppet has 17 failures [10:21:12] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: puppet fail [10:21:12] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: puppet fail [10:21:13] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Puppet has 20 failures [10:21:13] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: puppet fail [10:21:14] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: puppet fail [10:21:14] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 19 failures [10:21:15] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [10:22:19] hashar: do you have time to help on deploying https://gerrit.wikimedia.org/r/#/c/306899/ ? [10:22:37] dcausse: yes [10:22:39] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Puppet has 30 failures [10:22:39] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Puppet has 30 failures [10:22:39] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Puppet has 48 failures [10:22:39] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Puppet has 28 failures [10:22:39] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 43 failures [10:22:42] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 36 failures [10:22:43] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: Puppet has 24 failures [10:22:43] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 24 failures [10:22:43] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: Puppet has 31 failures [10:22:43] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 24 failures [10:22:43] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Puppet has 26 failures [10:22:43] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 53 failures [10:22:44] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 25 failures [10:22:44] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Puppet has 29 failures [10:22:45] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Puppet has 28 failures [10:22:45] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Puppet has 24 failures [10:23:02] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 30 failures [10:23:03] dcausse: +2 ed [10:23:13] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 24 failures [10:23:14] ok [10:23:22] PROBLEM - puppet last run on mw2164 is CRITICAL: CRITICAL: Puppet has 26 failures [10:23:23] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Puppet has 24 failures [10:23:23] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 19 failures [10:23:23] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: Puppet has 28 failures [10:23:24] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 32 failures [10:23:33] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: Puppet has 24 failures [10:23:33] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 25 failures [10:23:33] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Puppet has 26 failures [10:23:33] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Puppet has 24 failures [10:23:34] PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: Puppet has 25 failures [10:23:53] PROBLEM - DPKG on mw2087 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:25:14] hashar: we could deploy everywhere, I'm monitoring cirrus job queues, we can revert if it causes unexpected spikes, does that sound like a good to you? [10:25:23] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:25:30] s/good/good plan/ [10:26:03] RECOVERY - DPKG on mw2087 is OK: All packages OK [10:28:30] dcausse: gotta find some canary job servers [10:28:44] ok [10:28:58] (03PS4) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:29:10] I have no idea where those jobs are being run [10:29:16] nor how the jobs are sharded [10:29:22] (03PS5) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:30:20] from puppet [10:30:27] # mw1161-1169 are job runners [10:30:37] #mw2080-mw2085 are jobrunners [10:31:10] and there are few more [10:31:53] hashar: from logs on fluorine: mw1161, mw1301, mw1304 and certainly others [10:32:00] (03PS6) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:32:39] (03PS7) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:32:46] I am pulling the patch on tin [10:32:54] ok [10:33:17] (03CR) 10Paladox: "@Dzahn this is now ready, it fixes the expanding link for good now :)" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [10:33:17] probably the jobs are not sharded, they just execute things as they go [10:33:28] just deploy to a single server first [10:33:56] then to all (my to EUR 0.02) [10:34:06] Submodule path 'extensions/CirrusSearch': rebased into '8eb1c0ca6115079fb67a6bd9b04e50ecd837187d' [10:34:19] patch is on tin [10:34:47] I can check queries on db at the same time [10:34:54] !log pulled https://gerrit.wikimedia.org/r/#/c/306899/ "Temporarilly redirect RedirectsAndIncomingLinks job to a single db" on tin for T143932 [10:34:55] T143932: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932 [10:35:12] going to scap pull on mw1161 first [10:35:17] ok [10:35:28] then I am not sure how much jobs there are [10:35:28] note that it will not be a single db, but 1 per shard [10:35:41] done on mw1161 [10:35:48] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Wonderful. Looks like I was on a wrong track for a while when I suggested to exclude "<". This disabled the regex for ticket numbers at th" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [10:35:52] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:35:59] not sure that one is processing much jobs though :D [10:36:24] I can now wait to identify one of those jobs to the new db [10:36:35] let me see [10:36:36] doing mw1162 as well [10:37:13] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:37:29] hashar: sounds good to me : 2016-08-26 10:36:57 [33ff7985ce1cc13874395b90] mw1161 frwikinews 1.28.0-wmf.16 runJobs INFO: cirrusSearchLinksUpdatePrioritized [10:37:44] RECOVERY - puppet last run on restbase1015 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:38:00] oh [10:39:03] I can confirm it's sending things to db1072 (vslow) [10:39:03] RECOVERY - puppet last run on mw2224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:03] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:39:05] I thought hhvm might have to be restarted, but I guess it is smart enough to catch the code update [10:39:12] the first jobs may be a bit slow [10:39:13] great jynus [10:39:24] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:39:24] lets warm it up a bit [10:39:30] then deploy to the fleet? [10:39:34] PROBLEM - puppet last run on prometheus1001 is CRITICAL: CRITICAL: Puppet has 5 failures [10:39:40] or should I update a few more mwapp servers ? [10:39:44] but that is only until until the buffer pool is warmed [10:40:03] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:40:03] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:40:08] let's wait, as I said, we are not in a hurry [10:40:13] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:40:13] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:40:23] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:40:34] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:40:44] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:40:54] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:41:14] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:41:25] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:41:25] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:33] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:41:42] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:41:43] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:52] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:41:52] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:41:52] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:41:52] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:53] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:53] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:54] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:41:54] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:41:59] !log ran scap pull on job runners mw1161 and mw1162 [10:42:02] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:42:02] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:42:02] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:03] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:42:03] RECOVERY - puppet last run on mw1265 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:03] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:42:04] RECOVERY - puppet last run on mc2012 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:42:04] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:42:04] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:05] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:42:05] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:42:12] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:42:12] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:13] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:14] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:42:23] RECOVERY - puppet last run on elastic1017 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:42:24] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:32] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:42:32] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:42:32] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:33] RECOVERY - puppet last run on elastic2019 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:42:33] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:42:33] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:42:34] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:42:34] RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:42:34] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:42:42] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:42:42] RECOVERY - puppet last run on titanium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:42:42] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:43] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:42:43] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:43] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:44] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:44] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:44] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:42:52] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:52] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:53] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:42:53] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:42:53] RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:42:53] RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:42:53] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:42:54] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:54] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:42:55] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:04] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:43:04] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:04] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:43:04] RECOVERY - puppet last run on labvirt1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:04] RECOVERY - puppet last run on mw1280 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:12] RECOVERY - puppet last run on elastic2002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:43:13] RECOVERY - puppet last run on mw2225 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:43:13] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:43:13] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:43:13] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:43:13] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:43:13] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:14] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:43:14] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:43:15] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:43:15] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:43:16] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:43:23] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:24] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:24] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:32] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:43:32] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:32] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:33] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:43:33] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:43:33] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:43:42] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:43:43] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:43:43] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:43] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:43] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:43] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:43:43] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:43:52] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:43:53] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:44:02] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:03] RECOVERY - puppet last run on db1088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:03] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:03] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:03] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:03] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:03] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:04] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:04] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:05] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:44:05] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:44:13] RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:44:13] RECOVERY - puppet last run on mw2220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:13] RECOVERY - puppet last run on pybal-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:13] RECOVERY - puppet last run on mw2240 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:44:13] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:13] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:44:14] RECOVERY - puppet last run on analytics1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:14] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:44:23] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:44:23] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:44:23] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:44:23] RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:24] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:24] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:44:33] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:33] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:42] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:44:43] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:44:52] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:44:52] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:44:53] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:53] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:53] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:53] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:53] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:54] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:54] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:55] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:44:55] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:44:56] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:45:02] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:45:03] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:45:03] RECOVERY - puppet last run on mw2064 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:45:03] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:04] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:04] RECOVERY - puppet last run on rdb2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:15] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:22] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:22] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:45:22] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:23] RECOVERY - puppet last run on ganeti2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:45:23] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:45:23] RECOVERY - puppet last run on restbase-test2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:45:23] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:45:24] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [10:45:32] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:32] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:45:33] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:33] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:45:33] RECOVERY - puppet last run on db2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:33] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:33] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:45:34] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:43] RECOVERY - puppet last run on auth1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:43] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:43] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:44] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:45:44] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:44] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:44] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:45:45] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:45:45] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:45:46] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:45:46] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:56] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:45:56] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:02] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:46:02] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:46:03] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:46:03] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:46:13] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:46:13] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:46:13] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:46:14] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:46:14] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:46:14] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:46:14] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:46:15] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:46:15] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:22] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:22] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:23] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:46:23] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:46:23] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:46:23] RECOVERY - puppet last run on sinistra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:23] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:24] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:24] RECOVERY - puppet last run on wtp2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:46:32] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:33] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:33] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:33] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:34] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:46:43] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:43] RECOVERY - puppet last run on ganeti1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:46:43] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:46:44] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:46:47] * hashar whistles [10:46:52] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:46:53] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:54] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:54] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:46:54] RECOVERY - puppet last run on ms-fe1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:46:54] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:02] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:47:03] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:47:03] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:47:03] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:04] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:04] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:47:04] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:47:04] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:47:10] dcausse: jynus: should I push to the rest of the servers? Currently it is solely mw1161 and mw1162 [10:47:12] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:47:13] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:47:13] RECOVERY - puppet last run on graphite2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:14] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:14] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:47:14] RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:47:22] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:23] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:47:23] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:47:23] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:47:23] RECOVERY - puppet last run on mw2233 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:47:23] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:24] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:33] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:47:34] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:37] hashar, looks good to me, no long queries in the last minutes [10:47:42] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:47:42] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:43] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:47:43] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:44] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:44] RECOVERY - puppet last run on bast3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:46] (03PS2) 10Giuseppe Lavagetto: puppetmaster::frontend: move vhost to role [puppet] - 10https://gerrit.wikimedia.org/r/306830 (https://phabricator.wikimedia.org/T143869) [10:47:53] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:53] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:47:53] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:53] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:47:54] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:03] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:48:03] hashar: for me job times are ok but I'm not sure that 2 servers can cause a visible spike [10:48:03] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:05] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:48:12] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:12] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:12] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:48:14] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:22] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:23] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:48:23] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:48:23] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:23] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:23] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:24] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:28] jynus: dcausse going to pull it on a few more servers [10:48:32] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:48:34] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:42] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:48:53] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster::frontend: move vhost to role [puppet] - 10https://gerrit.wikimedia.org/r/306830 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [10:49:03] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:04] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:49:04] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:49:04] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:06] hashar: if more concurrent queries can cause troubles then it sounds safer yes [10:49:12] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:49:12] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:49:32] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:50:01] !log running scap pull on mw1163 mw1164 mw1165 mw1166 mw1167 mw1168 (job runners) for T143932 [10:50:09] T143932: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932 [10:50:30] (03PS1) 10Filippo Giunchedi: prometheus: restrict node_exporter to $prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/306902 [10:50:40] (03PS8) 10Paladox: Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) [10:50:52] synced [10:51:00] mw1161 to mw1168 have the new code [10:51:07] so we went from 2 to 8 servers [10:51:09] anybody knows what is radon? [10:51:18] the gaz ? [10:51:32] <_joe_> the dns master [10:51:36] <_joe_> jynus: why? [10:51:58] (03CR) 10Paladox: "I only brought back the space that was there before since I doint want it breaking because of me doing something that we can do in a separ" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [10:52:00] <_joe_> the auth dns master to be exact [10:52:07] what??? [10:52:54] (03CR) 10Filippo Giunchedi: "PCC: https://puppet-compiler.wmflabs.org/3856/" [puppet] - 10https://gerrit.wikimedia.org/r/306902 (owner: 10Filippo Giunchedi) [10:53:03] <_joe_> jynus: can you tell me what's up? [10:53:26] (03CR) 10Paladox: [C: 031] "It works :)" [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [10:53:58] ok, I thought there was an issue [10:54:11] it says radon, but the ip is a snapshot server [10:56:02] false alarm, I thought a dns server was connecting to mysql and made no sense [10:56:17] while I was checking queries for an unrelated reason [10:57:20] * apergos raises an eyebrow [10:57:55] no, dumps have all right to dump tables; dns servers don't [10:57:58] :-) [10:59:44] (03PS3) 10Giuseppe Lavagetto: puppetmaster::frontend: move vhost to role [puppet] - 10https://gerrit.wikimedia.org/r/306830 (https://phabricator.wikimedia.org/T143869) [11:00:14] heh [11:01:08] (03CR) 10Muehlenhoff: [C: 031] "One nit (see comment, feel free to ignore), but looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306902 (owner: 10Filippo Giunchedi) [11:01:51] (03PS1) 10Urbanecm: Add throttling exception for UBC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306903 (https://phabricator.wikimedia.org/T143951) [11:02:27] hashar, jynus: I can see an increase https://graphite.wikimedia.org/S/Bn, this is certainly expected [11:03:04] now it's hard to tell what's an acceptable increase for the job queue broker :/ [11:03:21] (03PS2) 10Filippo Giunchedi: prometheus: restrict node_exporter to $prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/306902 [11:03:34] (03CR) 10Filippo Giunchedi: prometheus: restrict node_exporter to $prometheus_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306902 (owner: 10Filippo Giunchedi) [11:03:49] dcausse, what is that, latency in ms? [11:03:52] (03CR) 10Paladox: "This https://gerrit.wikimedia.org/r/#/c/306413/ should fix the problem now :)" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [11:03:55] jynus: yes [11:04:13] expect it higher at first, but it will decreas back after warmup [11:04:21] ok [11:04:42] I do not want you on a single server normally [11:05:06] https://grafana.wikimedia.org/dashboard/db/job-queue-health gives a nice overview [11:05:16] (disclaimer I did the first graph with red/green bars) [11:05:23] and also let you select specific jobs [11:05:25] this job being slow is not an issue, I'm more concerned by the overall job queue health [11:06:07] (03PS1) 10Ladsgroup: Disable ORES for reverted, goodfaith and wp10 models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306904 (https://phabricator.wikimedia.org/T143988) [11:08:15] !log Restarting all the Java daemons on most of the analytics* hosts for security upgrades [11:09:09] dcausse: and https://grafana.wikimedia.org/dashboard/db/job-queue-rate?from=1472198937699&to=1472209737699&var-Job=cirrusSearchIncomingLinkCount&var-Job=cirrusSearchLinksUpdate&var-Job=cirrusSearchLinksUpdatePrioritized [11:09:22] dcausse: shows you the job processing rate for the three cirrus search [11:09:28] over the last three hours [11:10:10] hashar: thanks, very usefull [11:10:34] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2585459 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:11:28] (03CR) 10Muehlenhoff: [C: 031] prometheus: restrict node_exporter to $prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/306902 (owner: 10Filippo Giunchedi) [11:11:32] (03PS4) 10Giuseppe Lavagetto: puppetmaster::frontend: move vhost to role [puppet] - 10https://gerrit.wikimedia.org/r/306830 (https://phabricator.wikimedia.org/T143869) [11:11:33] dcausse: and https://grafana.wikimedia.org/dashboard/db/job-queue-health?from=1472199085655&to=1472209885655&var-jobType=cirrusSearchIncomingLinkCount&var-jobType=cirrusSearchLinksUpdate&var-jobType=cirrusSearchLinksUpdatePrioritized [11:11:59] dcausse: over 3 hours gives you the global queue status and next to the bottom there should be three graphs showing the wait time for each of the three jobs [11:12:53] For cirrusSearchLinksUpdate : https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=17&fullscreen&from=1472199150512&to=1472209950512&var-jobType=cirrusSearchIncomingLinkCount&var-jobType=cirrusSearchLinksUpdate&var-jobType=cirrusSearchLinksUpdatePrioritized [11:13:33] (03PS1) 10Muehlenhoff: Ship a script to rewrite group memberships after enabling the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/306905 (https://phabricator.wikimedia.org/T142817) [11:14:48] hashar: thanks, do you think we can move forward? [11:15:09] for me it sounds good [11:16:02] in the worst case we can revert and wait for Erik to merge the patch to swtich back to elastic from inc link counting [11:22:05] for me as well [11:23:47] !log hashar@tin Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch/includes/BuildDocument/RedirectsAndIncomingLinks.php: Temporarilly redirect RedirectsAndIncomingLinks job to a single db T143932 (duration: 00m 49s) [11:23:50] dcausse: jynus: I have pushed the CirrusSearch patch to all app servers [11:23:59] T143932: MySQL chooses poor query plan for link counting query - https://phabricator.wikimedia.org/T143932 [11:25:36] hashar: thanks [11:26:12] all of that without even understanding what the job is meant to do :D [11:26:35] 06Operations, 13Patch-For-Review: Update firejail to 0.9.40 - https://phabricator.wikimedia.org/T121756#2585481 (10MoritzMuehlenhoff) [11:26:38] :) [11:26:58] 06Operations, 13Patch-For-Review: Update firejail to 0.9.40 - https://phabricator.wikimedia.org/T121756#1887385 (10MoritzMuehlenhoff) 05Open>03Resolved All hosts have been upgraded to firejail 0.9.40 in the mean time. [11:28:06] hashar, jynus: thanks for your help, and apologies for the mess caused by cirrus this week [11:30:42] just ping me at any time, I am here to help you [11:31:12] 06Operations, 06Performance-Team, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2585498 (10hashar) Congratulations ! Might want a low pr... [11:31:32] (03PS1) 10Jcrespo: Fix puppet issues generating empty files for mysql configuration [puppet] - 10https://gerrit.wikimedia.org/r/306906 (https://phabricator.wikimedia.org/T126757) [11:31:53] dcausse: I dont mind mess as long as folks show up with a sweeper and clean it up :] [11:33:28] (03CR) 10Jcrespo: [C: 032] Fix puppet issues generating empty files for mysql configuration [puppet] - 10https://gerrit.wikimedia.org/r/306906 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [11:35:24] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:36:45] RECOVERY - puppet last run on prometheus1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:36:53] RECOVERY - puppet last run on prometheus1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:47:02] RECOVERY - puppet last run on prometheus2002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:55:47] 10Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2585537 (10hashar) I will rebuild Zuul to try a patch for T128569 ( https://github.com/openstack-infra/zuul/commit/a8... [11:58:47] _joe_ would it be ok to create subtasks to T143536 for things like beta appservers upgrade ? [11:58:47] T143536: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536 [11:59:03] I am chatting with hashar about the best way to track it [11:59:31] there is also another related task to remove the logstash black list for the apache error logs [12:00:03] otherwise we could simply create new tasks and add a reference [12:02:06] (03PS1) 10Muehlenhoff: Disable unprivileged user namespaces on labvirt nodes running 4.4 HWE kernels [puppet] - 10https://gerrit.wikimedia.org/r/306910 (https://phabricator.wikimedia.org/T142567) [12:04:29] elukey: maybe copy paste joe task but for beta cluster and each kind of servers being a checklist [12:04:45] eg: Upgrade all mw* servers on beta cluster to jessie [12:04:53] [ ] app [12:04:55] [ ] jobrunner [12:04:56] etc [12:05:29] good idea, then we'll decide if it is wise to make it a subtask [12:05:59] yeah [12:06:05] hashar: where should I start to look for hostnames etc..? [12:06:17] mediawiki-config [12:06:34] has some ip and often the fqdn as a comment [12:06:46] then rest is in puppet and hopefully mostly under /hieradata/ [12:07:05] the chain on beta is similar to prod albeit without lvs [12:07:21] so we get: end users --> nginx --> varnish front --> varnish backend --> pool of app servers [12:07:31] with varnish having each of the app server in its config [12:07:44] whereas in production that is a service IP / LVS to load balance accross the servers [12:08:38] hieradata/labs/deployment-prep/common.yaml has some list under scap::dsh::groups [12:10:24] it is probably easier to create a new jessie instance, provision it with puppet [12:10:28] then get it added to the pool [12:10:32] yeah [12:11:38] afk a bit [12:16:52] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Fix phabricator expanding links [puppet] - 10https://gerrit.wikimedia.org/r/306413 (https://phabricator.wikimedia.org/T75997) (owner: 10Paladox) [12:27:26] !log dropping databases on es2001 (unused) to make space for temporary data archival [12:29:13] RECOVERY - Disk space on es2001 is OK: DISK OK [12:33:33] <_joe_> elukey: of course it would be [12:34:43] (03PS5) 10Giuseppe Lavagetto: puppetmaster::frontend: move vhost to role [puppet] - 10https://gerrit.wikimedia.org/r/306830 (https://phabricator.wikimedia.org/T143869) [12:37:22] PROBLEM - MD RAID on copper is CRITICAL: CRITICAL: Active: 3, Working: 3, Failed: 1, Spare: 0 [12:37:36] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::frontend: move vhost to role [puppet] - 10https://gerrit.wikimedia.org/r/306830 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [12:38:27] (03PS1) 10Jcrespo: Reimage db1042 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/306912 [12:42:14] (03CR) 10Jcrespo: [C: 032] Reimage db1042 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/306912 (owner: 10Jcrespo) [12:42:26] (03PS2) 10Jcrespo: Reimage db1042 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/306912 [12:49:05] (03PS2) 10Giuseppe Lavagetto: puppetmaster::frontend: add vhost for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/306831 (https://phabricator.wikimedia.org/T143869) [12:56:09] (03PS4) 10Faidon Liambotis: nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 [12:56:40] (03PS2) 10Faidon Liambotis: nagios: make check_sslxNN multithreaded [puppet] - 10https://gerrit.wikimedia.org/r/306866 [12:56:42] (03PS2) 10Faidon Liambotis: nagios: fix check_ssl with a newer IO::Socket::SSL [puppet] - 10https://gerrit.wikimedia.org/r/306867 [12:56:44] (03PS5) 10Faidon Liambotis: nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 [12:57:49] (03CR) 10Faidon Liambotis: [C: 032] nagios: make check_sslxNN multithreaded [puppet] - 10https://gerrit.wikimedia.org/r/306866 (owner: 10Faidon Liambotis) [12:58:03] (03CR) 10Faidon Liambotis: [C: 032] nagios: fix check_ssl with a newer IO::Socket::SSL [puppet] - 10https://gerrit.wikimedia.org/r/306867 (owner: 10Faidon Liambotis) [12:58:12] (03CR) 10Jcrespo: [C: 031] prometheus: restrict node_exporter to $prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/306902 (owner: 10Filippo Giunchedi) [12:58:21] (03CR) 10Faidon Liambotis: [C: 032] nagios: add no-SNI mode to check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/182306 (owner: 10Faidon Liambotis) [13:00:51] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::frontend: add vhost for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/306831 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [13:00:57] (03PS3) 10Giuseppe Lavagetto: puppetmaster::frontend: add vhost for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/306831 (https://phabricator.wikimedia.org/T143869) [13:01:43] (03CR) 10Faidon Liambotis: [C: 032] "This is great to see, thank you so much for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/306894 (owner: 10Muehlenhoff) [13:01:49] (03PS2) 10Faidon Liambotis: Remove $ALL_NETWORKS ferm definition [puppet] - 10https://gerrit.wikimedia.org/r/306894 (owner: 10Muehlenhoff) [13:02:14] (03CR) 10Faidon Liambotis: [V: 032] Remove $ALL_NETWORKS ferm definition [puppet] - 10https://gerrit.wikimedia.org/r/306894 (owner: 10Muehlenhoff) [13:02:36] <_joe_> paravoid: behave [13:02:42] what? [13:02:46] <_joe_> I was waiting jenkins like a good citizen :P [13:02:49] oh [13:02:49] haha :) [13:03:11] (03PS4) 10Giuseppe Lavagetto: puppetmaster::frontend: add vhost for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/306831 (https://phabricator.wikimedia.org/T143869) [13:03:22] <_joe_> did you merge your patches? [13:03:26] I did [13:03:29] <_joe_> ok [13:03:37] <_joe_> so let me break the puppetmasters for you [13:04:03] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster::frontend: add vhost for FQDN [puppet] - 10https://gerrit.wikimedia.org/r/306831 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [13:04:48] (03PS3) 10Filippo Giunchedi: prometheus: restrict node_exporter to $prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/306902 [13:05:15] (03Abandoned) 10Faidon Liambotis: Add a third-party cron module [puppet] - 10https://gerrit.wikimedia.org/r/62955 (owner: 10Faidon Liambotis) [13:06:29] (03CR) 10Faidon Liambotis: [C: 031] "Also thanks for working on this :)" [puppet] - 10https://gerrit.wikimedia.org/r/306905 (https://phabricator.wikimedia.org/T142817) (owner: 10Muehlenhoff) [13:07:31] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: restrict node_exporter to $prometheus_nodes [puppet] - 10https://gerrit.wikimedia.org/r/306902 (owner: 10Filippo Giunchedi) [13:07:59] (03PS2) 10Jcrespo: mariadb: install node/mysql exporters in eqiad too [puppet] - 10https://gerrit.wikimedia.org/r/306174 (https://phabricator.wikimedia.org/T126757) (owner: 10Filippo Giunchedi) [13:08:39] you guys can't wait for jenkins, your vacations are about to start [13:08:50] ;) [13:09:14] don't push me, I'll upgrade cr1-esams/cr2-knams :P [13:09:21] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_frontend: remove unnecessary require [puppet] - 10https://gerrit.wikimedia.org/r/306915 [13:09:29] (03CR) 10Filippo Giunchedi: base/monitoring: add optional SMART disk check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) (owner: 10Dzahn) [13:10:49] (03CR) 10Faidon Liambotis: [C: 031] "That sounds fine, although "Note that if the log files are removed automatically, recovery after a catastrophic failure is likely to be im" [puppet] - 10https://gerrit.wikimedia.org/r/305992 (https://phabricator.wikimedia.org/T143302) (owner: 10Muehlenhoff) [13:12:28] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [13:12:35] <_joe_> that's me ^^ [13:12:37] (03CR) 10Jcrespo: [C: 031] "When you are ready." [puppet] - 10https://gerrit.wikimedia.org/r/306174 (https://phabricator.wikimedia.org/T126757) (owner: 10Filippo Giunchedi) [13:12:40] <_joe_> damn dependency trees [13:12:58] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2585762 (10Cmjohnson) okay, odd that it would be another disk issue but that makes the most sense. I will request a new one. [13:13:10] (03PS2) 10Giuseppe Lavagetto: puppetmaster::web_frontend: remove unnecessary require [puppet] - 10https://gerrit.wikimedia.org/r/306915 [13:13:54] (03CR) 10Muehlenhoff: "Yeah, that's all in the backups. Also with our checkpointing settings in slapd.conf we're really only needing the last BDB transaction fil" [puppet] - 10https://gerrit.wikimedia.org/r/305992 (https://phabricator.wikimedia.org/T143302) (owner: 10Muehlenhoff) [13:13:56] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::web_frontend: remove unnecessary require [puppet] - 10https://gerrit.wikimedia.org/r/306915 (owner: 10Giuseppe Lavagetto) [13:14:42] (03CR) 10Filippo Giunchedi: [C: 032] mariadb: install node/mysql exporters in eqiad too [puppet] - 10https://gerrit.wikimedia.org/r/306174 (https://phabricator.wikimedia.org/T126757) (owner: 10Filippo Giunchedi) [13:14:49] (03PS3) 10Filippo Giunchedi: mariadb: install node/mysql exporters in eqiad too [puppet] - 10https://gerrit.wikimedia.org/r/306174 (https://phabricator.wikimedia.org/T126757) [13:17:40] <_joe_> I can't connect to port 8140 on palladium [13:18:26] same here, just tried a puppet run [13:18:35] <_joe_> yeah for some reason apache died [13:18:39] <_joe_> probably my fault [13:18:41] <_joe_> let me see [13:19:02] * godog grabs umbrella for incoming puppet showers [13:19:10] <_joe_> yeah it's going to look bad [13:19:12] 8140 seems open on ipv6, could be it? [13:19:19] <_joe_> ferm is reloading everywhere [13:19:25] <_joe_> jynus: nope, apache died [13:19:50] <_joe_> paravoid: was ferm reloading everywhere expected? [13:20:04] yeah [13:20:06] <_joe_> let me see if this is my fault [13:20:06] yes [13:20:11] the ALL_NETWORKS definition got removed [13:20:17] <_joe_> or maybe a race condition [13:20:41] does that mean some seconds of disconnection in this case? [13:21:38] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Puppet has 33 failures [13:21:39] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Puppet has 37 failures [13:21:40] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet has 19 failures [13:21:40] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 83 failures [13:21:48] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 65 failures [13:21:48] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 92 failures [13:21:48] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 49 failures [13:22:04] (I killed it) [13:22:24] <_joe_> ok it's definitely my fault, something funny going on with certificates, meh [13:22:57] no problem [13:24:55] (03CR) 10Mobrovac: [C: 04-1] "This seems unnecessary. If a service is not using those, then them being there doesn't change anything." [puppet] - 10https://gerrit.wikimedia.org/r/306842 (owner: 10Ppchelko) [13:30:59] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 31 failures [13:31:01] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: puppet fail [13:31:01] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: puppet fail [13:31:01] PROBLEM - puppet last run on db1086 is CRITICAL: CRITICAL: puppet fail [13:31:02] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: puppet fail [13:31:02] PROBLEM - puppet last run on labsdb1008 is CRITICAL: CRITICAL: Puppet has 31 failures [13:31:02] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: puppet fail [13:31:06] shut up :P [13:37:03] jynus: (I'll do the bot :) we're getting pages [13:37:59] I know, it is just a server bing reimaged [13:38:35] k, thanks [13:38:44] sorry about that [13:39:59] (03PS1) 10Jcrespo: Move all firewall setup for mariadb::core to the role [puppet] - 10https://gerrit.wikimedia.org/r/306918 [13:44:58] (03CR) 10Andrew Bogott: [C: 031] Disable unprivileged user namespaces on labvirt nodes running 4.4 HWE kernels [puppet] - 10https://gerrit.wikimedia.org/r/306910 (https://phabricator.wikimedia.org/T142567) (owner: 10Muehlenhoff) [13:47:27] (03PS2) 10Jcrespo: Move all firewall setup for mariadb::core to the role [puppet] - 10https://gerrit.wikimedia.org/r/306918 [13:54:50] (03PS1) 10Giuseppe Lavagetto: puppetmaster::frontend: raise priority of the 'puppet' vhost [puppet] - 10https://gerrit.wikimedia.org/r/306922 [13:59:02] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::frontend: raise priority of the 'puppet' vhost [puppet] - 10https://gerrit.wikimedia.org/r/306922 (owner: 10Giuseppe Lavagetto) [13:59:07] (03PS3) 10Jcrespo: mysql: Clean up puppet code related to code databases [puppet] - 10https://gerrit.wikimedia.org/r/306918 [14:00:42] (03PS4) 10Jcrespo: mysql: Clean up puppet code related to code databases [puppet] - 10https://gerrit.wikimedia.org/r/306918 [14:01:53] (03Abandoned) 10Jcrespo: Remove ferm exceptions for iron being a DBA maintenance host [puppet] - 10https://gerrit.wikimedia.org/r/274366 (owner: 10Muehlenhoff) [14:02:25] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:03:05] <_joe_> lol it was all the precise hosts left [14:03:06] Volker_E: Would you be comfortable submitting a puppet patch to implement https://phabricator.wikimedia.org/T143465? If not I'll need to get your key some secure way. [14:03:40] <_joe_> for some reason they don't appear to support SNI, so they always get the default virtualhost [14:06:45] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:07:34] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:08:45] PROBLEM - puppet last run on ms-be2024 is CRITICAL: CRITICAL: Puppet has 1 failures [14:09:01] (03PS2) 10Giuseppe Lavagetto: puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) [14:14:26] RECOVERY - puppet last run on potassium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:14:54] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:32] (03PS3) 10Giuseppe Lavagetto: puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) [14:15:47] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:16:27] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:16:39] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [14:17:06] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2569112 (10Andrew) @Volker_E, we'll need a public ssh key to do this. Either paste here or submit a puppet patch to puppet/modules/admin/data/data.yaml. [14:17:31] (03PS4) 10Giuseppe Lavagetto: puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) [14:17:33] 06Operations, 10Ops-Access-Requests: Requesting access to the statistics host(s) for flemmerich - https://phabricator.wikimedia.org/T143881#2582044 (10Andrew) @leila, do you know what group this is, or can you point to another user that has similar access? It's not immediately obvious to me what this encompas... [14:18:33] (03PS1) 10Jcrespo: prometheus: Test mysqld-exporter on s6 slaves to check load impact [puppet] - 10https://gerrit.wikimedia.org/r/306928 (https://phabricator.wikimedia.org/T126757) [14:20:14] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/306928 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [14:20:48] (03CR) 10Jcrespo: [C: 032] prometheus: Test mysqld-exporter on s6 slaves to check load impact [puppet] - 10https://gerrit.wikimedia.org/r/306928 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [14:21:01] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [14:21:26] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [14:21:35] (03PS5) 10Giuseppe Lavagetto: puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) [14:22:10] (03CR) 10Faidon Liambotis: [C: 04-1] "Yeah, this would definitely need to run against all disks on a system. A hiera setting shouldn't be needed" [puppet] - 10https://gerrit.wikimedia.org/r/304580 (https://phabricator.wikimedia.org/T86552) (owner: 10Dzahn) [14:23:03] 06Operations, 06Performance-Team, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2585992 (10elukey) 05Open>03Resolved [14:24:13] (03CR) 10Giuseppe Lavagetto: [V: 032] puppetmaster::frontend: get workers from hiera [puppet] - 10https://gerrit.wikimedia.org/r/306832 (https://phabricator.wikimedia.org/T143869) (owner: 10Giuseppe Lavagetto) [14:25:18] 06Operations, 10Ops-Access-Requests: Requesting access to the statistics host(s) for flemmerich - https://phabricator.wikimedia.org/T143881#2582044 (10MoritzMuehlenhoff) @flemmerich This access request misses the details of what you want to access in particular, see https://wikitech.wikimedia.org/wiki/Producti... [14:26:40] 06Operations, 10Wikimedia-Apache-configuration: Remove apache error log filters after all the migration - https://phabricator.wikimedia.org/T144005#2586001 (10elukey) [14:29:00] 06Operations, 07HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2586022 (10elukey) [14:29:15] hashar: --^ [14:30:06] 06Operations, 10Wikimedia-Apache-configuration: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005#2586037 (10elukey) [14:32:27] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2586056 (10MoritzMuehlenhoff) @kaldari : I have built an initial package, do you have some SVGs which use these fonts for testing? [14:32:27] RECOVERY - puppet last run on ms-be2024 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:36:06] 10Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2586065 (10Paladox) @hashar we should also bump the version to 2.5.0 per https://github.com/openstack-infra/zuul/rele... [15:08:35] (03PS1) 10BBlack: text VCL: limited redirect for awful TLS negotiations [puppet] - 10https://gerrit.wikimedia.org/r/306935 [15:08:42] 06Operations, 10DBA: mysql boxes not in ganglia - https://phabricator.wikimedia.org/T87209#2586204 (10jcrespo) This will probably end up being a "Won't fix" if https://grafana.wikimedia.org/dashboard/db/mysql ends up being successful. [15:09:44] (03PS1) 10Jcrespo: Labsdb: include labs salt groups and prometheus monitoring for dbs [puppet] - 10https://gerrit.wikimedia.org/r/306936 (https://phabricator.wikimedia.org/T126757) [15:10:09] ^godog: this should fix labsdbs [15:12:19] (03PS2) 10BBlack: text VCL: limited redirect for awful TLS negotiations [puppet] - 10https://gerrit.wikimedia.org/r/306935 [15:13:26] <_joe_> bblack: that's pretty cool :) [15:14:09] (03PS1) 10Jcrespo: dbproxy: add prometheus node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/306937 (https://phabricator.wikimedia.org/T126757) [15:14:35] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:17:14] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:18:02] (03PS2) 10Giuseppe Lavagetto: [WiP] puppetmaster::gitclone: support primary/secundary masters [puppet] - 10https://gerrit.wikimedia.org/r/306833 [15:21:08] TE MATARE CON MIS PROPIAS MANOS HIJO DE PERRA ALVARO MOLINA [15:21:09] TE MATARE CON MIS PROPIAS MANOS HIJO DE PERRA ALVARO MOLINA [15:22:16] jynus: neat, fails to compile on e.g. labsdb1011 though https://puppet-compiler.wmflabs.org/3862/labsdb1011.eqiad.wmnet/change.labsdb1011.eqiad.wmnet.err [15:22:56] oh [15:23:06] I think the paremeter is mysql_group or something [15:23:17] (03PS1) 10Jcrespo: es2001-4: add node exporter to this standalones hosts [puppet] - 10https://gerrit.wikimedia.org/r/306939 (https://phabricator.wikimedia.org/T126757) [15:25:10] (03PS2) 10Jcrespo: Labsdb: include labs salt groups and prometheus monitoring for dbs [puppet] - 10https://gerrit.wikimedia.org/r/306936 (https://phabricator.wikimedia.org/T126757) [15:27:41] godog, https://puppet-compiler.wmflabs.org/3863/labsdb1011.eqiad.wmnet/ [15:44:01] (03PS1) 10BryanDavis: logstash: Stop dropping mod_proxy_fcgi warnings [puppet] - 10https://gerrit.wikimedia.org/r/306943 (https://phabricator.wikimedia.org/T73487) [15:50:35] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: puppet fail [15:50:39] bd808: o/ --^ would need to wait a bit since we haven't patched the ubuntu app servers :( [15:51:18] elukey: heh. but the bug was closed! [15:51:47] it might be fun to see the errors before/after too [15:52:09] they go away completely with the new jessie version [15:52:15] we just filtered them out to reduce noise on the fatalmonitor report screen in kibana [15:52:37] I opened a subtask in https://phabricator.wikimedia.org/T143536 [15:52:58] to basically apply your patch [15:53:16] so I'll link your patch to it [15:53:29] thanks! [15:53:43] 06Operations, 10Wikimedia-Apache-configuration: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005#2586386 (10elukey) https://gerrit.wikimedia.org/r/306943 [15:53:50] (03PS1) 10ArielGlenn: add max_allowed_packet to xml/sql dump config so mysqldump doesn't whine [puppet] - 10https://gerrit.wikimedia.org/r/306944 [15:54:33] (03CR) 10Elukey: [C: 04-1] "We'd need just to wait for https://phabricator.wikimedia.org/T143536 because we didn't fix the issue for the ubuntu appservers :(" [puppet] - 10https://gerrit.wikimedia.org/r/306943 (https://phabricator.wikimedia.org/T73487) (owner: 10BryanDavis) [15:54:56] elukey: my thanks to you for actually caring about that error! I filtered the logs out after asking a few times and being told that it was a known issue that we should ignore [15:56:18] bd808: nothing user facing, we were sending a 304 correctly and then log 503s and horrible apache errors.. Really weird :) [15:56:20] (03PS2) 10ArielGlenn: add max_allowed_packet to xml/sql dump config so mysqldump doesn't whine [puppet] - 10https://gerrit.wikimedia.org/r/306944 [15:58:34] 06Operations, 06Performance-Team, 10Wikimedia-Apache-configuration, 07HHVM, and 2 others: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2586403 (10elukey) As FYI we'll be able to merge the above patch only... [15:59:10] (03CR) 10Jcrespo: "Yes, we used to have 16M, we had to increase it to 32M, and from there we applied it everywhere. Sorry I missed this." [puppet] - 10https://gerrit.wikimedia.org/r/306944 (owner: 10ArielGlenn) [15:59:16] (03CR) 10Jcrespo: [C: 031] add max_allowed_packet to xml/sql dump config so mysqldump doesn't whine [puppet] - 10https://gerrit.wikimedia.org/r/306944 (owner: 10ArielGlenn) [16:10:09] (03CR) 10ArielGlenn: [C: 032] add max_allowed_packet to xml/sql dump config so mysqldump doesn't whine [puppet] - 10https://gerrit.wikimedia.org/r/306944 (owner: 10ArielGlenn) [16:11:10] (03PS1) 10ArielGlenn: add max_allowed_packet config setting for dumps [dumps] - 10https://gerrit.wikimedia.org/r/306946 [16:11:55] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: puppet fail [16:11:59] jynus: no worries, it's all taken care of, or it will be shortly :-) [16:16:36] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:39] 06Operations, 10Traffic, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2586464 (10BBlack) Following up a little further on the AES256 arguments: one possible counter-argument is that our current-best (and most popular by far) key exchange primitive... [16:22:19] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed(?) sda on ms-be1022 - https://phabricator.wikimedia.org/T140597#2586492 (10Cmjohnson) Ticket created to replace SSD Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the d... [16:30:34] (03CR) 10ArielGlenn: [C: 032] add max_allowed_packet config setting for dumps [dumps] - 10https://gerrit.wikimedia.org/r/306946 (owner: 10ArielGlenn) [16:35:25] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:35:57] (03CR) 10Mobrovac: [C: 04-1] "So this assumes /sys/links is in place, which means it depends on" [puppet] - 10https://gerrit.wikimedia.org/r/306857 (owner: 10Ppchelko) [16:39:09] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586527 (10AndyRussG) Should we make a task for talking to third-party users? Announce the cha... [16:46:01] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586538 (10BBlack) Monday morning's fine. IIRC from the meeting, the number of 3rd party wiki... [16:47:02] 06Operations, 10Phabricator: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2586542 (10mmodell) [16:49:31] (03PS2) 10Ppchelko: Change-Prop: Rerender summary on wikidata item update [puppet] - 10https://gerrit.wikimedia.org/r/306857 [16:52:48] (03CR) 10Mobrovac: [C: 031] Change-Prop: Rerender summary on wikidata item update [puppet] - 10https://gerrit.wikimedia.org/r/306857 (owner: 10Ppchelko) [16:53:38] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2586592 (10Dereckson) @Mjohnson_WMF what are your plans to pick a name and so we can move this task forward? Past experience shows the... [16:54:51] (03PS15) 10Yuvipanda: Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [16:54:59] (03CR) 10Yuvipanda: [C: 032 V: 032] Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 (owner: 10Yuvipanda) [16:55:24] RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough [16:56:15] RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough [16:59:01] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586632 (10AndyRussG) >>! In T143271#2586538, @BBlack wrote: > Monday morning's fine. Great,... [16:59:47] (03CR) 10Dereckson: [C: 031] "Network ranges looks good to @Darkoneko. Thanks to him for checking." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306903 (https://phabricator.wikimedia.org/T143951) (owner: 10Urbanecm) [17:00:13] (03PS8) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [17:00:59] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2586639 (10Mjohnson_WMF) My preference for the URL remains https://projectcom.wikimedia.org. As an abbreviated handle, projectcom match... [17:01:24] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [17:01:56] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 754 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4871872 keys - replication_delay is 754 [17:03:09] (03Abandoned) 10Ppchelko: Change-Prop: Remove unused request templates from the config. [puppet] - 10https://gerrit.wikimedia.org/r/306842 (owner: 10Ppchelko) [17:12:04] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4854150 keys - replication_delay is 0 [17:13:20] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2586690 (10BBlack) [17:14:21] 06Operations, 06MediaWiki-Stakeholders-Group, 10Traffic, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2586708 (10BBlack) [17:14:24] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2562516 (10BBlack) 05Open>03Resolved a:03BBlack Ditto the other ticket: This seems resolved with the new release y... [17:20:01] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586731 (10BBlack) Relevant: https://wikiapiary.com/wiki/Extension:CentralNotice [17:30:24] (03CR) 10Chad: "Similar work was already done on git::clone in I6c723973 and merged, will need rebasing on top of that." [puppet] - 10https://gerrit.wikimedia.org/r/306430 (owner: 10Giuseppe Lavagetto) [17:31:36] 06Operations, 10Ops-Access-Requests: Requesting access to the statistics host(s) for flemmerich - https://phabricator.wikimedia.org/T143881#2586802 (10leila) @Andrew the details are in the description of T143718. (I'm not sure how this task got created, automatically or not, as a result I'm a bit confused. I c... [17:32:13] 06Operations, 10Domains, 10Traffic, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2586805 (10Dereckson) @Platonides fine for you? [17:32:44] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2576804 (10leila) [17:33:42] (03CR) 10Volans: "-1 for the global module" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305804 (owner: 10Yuvipanda) [17:33:49] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2576804 (10leila) @AlexMonk-WMF I think this task is ready for Ops review? [17:37:49] greg-g: I'd like to do a quick striker deploy. I know it's Friday, but ... it's keeping yuvipanda from fully enjoying the app. T143956 [17:37:49] T143956: Yuvipanda can't connect his phab account - https://phabricator.wikimedia.org/T143956 [17:38:31] well shit, that's an UBN! if I ever saw one [17:38:39] :) [17:38:45] :) [17:40:13] (03PS9) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [17:40:27] bd808: to be explicit: go forth [17:40:31] (03PS1) 10Yuvipanda: clush: Fixup missing dependency + secret [puppet] - 10https://gerrit.wikimedia.org/r/306956 [17:41:24] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [17:41:29] minimal risk (brand new service, not a SPOF for anyone (yet)) and minimal and obvious change [17:41:45] brb [17:41:52] !log change-prop deploying 08b8add4 [17:42:47] !log Updated striker to fix T143956 [17:42:48] T143956: Yuvipanda can't connect his phab account - https://phabricator.wikimedia.org/T143956 [17:43:04] missing bot is missing [17:45:01] (03CR) 10Yuvipanda: "Currently, we have no orchestration for tools at all - we either run clush from our local laptops or xargs into ssh. Both very unideal sol" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305804 (owner: 10Yuvipanda) [17:45:11] (03PS10) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [17:45:31] thanks a lot for the review volans :) [17:45:49] volans I've responded, and also put up https://gerrit.wikimedia.org/r/#/c/306956/ [17:45:56] yuvipanda: just saw it by chance :) [17:46:01] to fix the issues you pointed out, and clearly mark this for only using it on labs [17:46:18] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [17:46:45] yeah, saw that too, I'm still in doubt about the host strict checking but I guess for labs it's ok-ish [17:47:05] volans yeah, for tools specifically, which is where it's being used right now. [17:47:34] I didn't look at the host generation scripts at all, I also don't have much context about them [17:47:50] I guess there is no salt on tools labs right? [17:48:09] volans yeah, problem there is that we don't have puppet resource collection in tools / labs in general, and ssh key checking is hard to implement without that [17:48:36] volans there's salt for all of labs, but not specific for tools. we have tools volunteer admins who don't have access to the saltmaster (since that runs on labcontrol1001.wikimedia.org). [17:48:45] Also, salt has fucked me over many many Times, during many outages [17:49:09] I've had times when it would literally not find even one single host even if I targetted it with fqdn [17:49:23] nothing more painful than opening 15 tabs in your terminal because of that [17:49:26] so I refuse to touch salt :) [17:49:36] mmmmh sounds strange, never happened to me in prod [17:50:04] I'm not speaking well of salt anyway ;) [17:50:10] volans yes, prod's salt is usually in much better shape [17:50:40] volans true, but I don't want to continue to xargs into ssh until that conversation ends :) I suspect there will be an actual evaluation, etc [17:50:56] so you need something that will be accessed by tool labs admin that are not also labs admin, make sense [17:51:09] have you considered if the clush user should be sudoers or not? [17:52:32] (03PS1) 10Madhuvishy: Convert puppet clone of cdnjs to cron [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) [17:53:34] (03PS2) 10Madhuvishy: toollabs: Convert puppet clone of cdnjs to cron [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) [17:53:36] volans we are discussing that just now [17:53:39] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Convert puppet clone of cdnjs to cron [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) (owner: 10Madhuvishy) [17:53:53] volans we do want it to be sudo'er, but also have nice audit trails and what not [17:54:22] yeah that is one of the thing I'm investigating too, we want the audit [17:54:41] but yeah for tool labs right now sounds good [17:54:52] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Convert puppet clone of cdnjs to cron [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) (owner: 10Madhuvishy) [17:55:09] volans :) wanna +1 https://gerrit.wikimedia.org/r/#/c/306956/? [17:55:19] (03PS3) 10Madhuvishy: toollabs: Convert puppet clone of cdnjs to cron [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) [17:58:30] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/306956 (owner: 10Yuvipanda) [17:59:00] (03PS2) 10Yuvipanda: clush: Fixup missing dependency + secret [puppet] - 10https://gerrit.wikimedia.org/r/306956 [17:59:05] (03CR) 10Yuvipanda: [C: 032 V: 032] clush: Fixup missing dependency + secret [puppet] - 10https://gerrit.wikimedia.org/r/306956 (owner: 10Yuvipanda) [18:04:25] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:05:38] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2586982 (10Anomie) If we have anything that does multiple correlated `foo[]`-style arrays, that would be ordering dependent. For example... [18:10:53] (03PS1) 10ArielGlenn: salt-misc: rename modules removing hyphens so pylint likes it better [software] - 10https://gerrit.wikimedia.org/r/306961 [18:10:55] (03PS1) 10ArielGlenn: salt-misc: doc strings for pylint, split up line parsing method [software] - 10https://gerrit.wikimedia.org/r/306962 [18:13:23] (03CR) 10ArielGlenn: [C: 032] salt-misc: rename modules removing hyphens so pylint likes it better [software] - 10https://gerrit.wikimedia.org/r/306961 (owner: 10ArielGlenn) [18:13:34] 06Operations, 10Deployment-Systems, 10scap, 03Scap3: Make keyholder work with systemd - https://phabricator.wikimedia.org/T144043#2587001 (10bd808) [18:17:59] (03CR) 10ArielGlenn: [C: 032] salt-misc: doc strings for pylint, split up line parsing method [software] - 10https://gerrit.wikimedia.org/r/306962 (owner: 10ArielGlenn) [18:22:54] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2587043 (10BBlack) >>! In T138093#2586982, @Anomie wrote: > If we have anything that does multiple correlated `foo[]`-style arrays, that... [18:23:18] (03CR) 10Esanders: "Is the objection to the word 'deprecated' or to having a comment at all?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [18:23:55] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2587045 (10AlexMonk-WMF) >>! In T143718#2586803, @leila wrote: > @AlexMonk-WMF I think this task is ready for Ops review? Possibly? I don... [18:27:31] (03PS1) 10ArielGlenn: enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 [18:27:35] well here goes nothin [18:28:24] (03CR) 10jenkins-bot: [V: 04-1] enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 (owner: 10ArielGlenn) [18:28:56] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2587063 (10Anomie) >>! In T138093#2587043, @BBlack wrote: > So long as our sorter preserves the relative order of duplicated parameter n... [18:29:36] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:30:36] figures. now I get to see how to run it on all the contents without needing an __init__.py in there [18:30:39] meeeehhhh [18:32:07] (03CR) 10Yurik: "Objection is only to the "deprecated" word. I am totally ok with a comment that prevents people from accidentally enabling it anywhere exc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [18:32:35] 06Operations, 10Ops-Access-Requests, 06Research-and-Data, 10Research-management: Request access to data for reader research - https://phabricator.wikimedia.org/T143718#2587073 (10leila) @AlexMonk-WMF got it. :) @ottomata I'm not sure how I can make sure if this task is received by Ops or not? Can you or s... [18:34:49] 06Operations, 10MediaWiki-General-or-Unknown, 06Services, 10Traffic: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2587076 (10BBlack) Apparently v3's libvmod-boltsort and v4's std.querysort() both fail to do so. Nothing about the handling of duplicat... [18:52:18] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2563658 (10yuvipanda) We've been using clush with tool labs for a while, and I really like it. I think requirements for individual labs projects in general will always be... [18:54:58] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2587114 (10yuvipanda) Fabric is also python2 only, and their ssh backend (paramiko) doesn't support the stricter security settings we run with anyway. [19:02:38] (03PS2) 10ArielGlenn: enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 [19:03:10] I admit defeat. added the stupid init file. but now.... it's going to make me rename the module. = the directory. = I'm annoyed. [19:03:21] (03CR) 10jenkins-bot: [V: 04-1] enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 (owner: 10ArielGlenn) [19:06:36] stupid thing! I would like it to find all the invalid names except the top level directory but there's no nice way to do that it seems [19:06:40] BAH HUMBUG [19:07:35] right. I have dinner here next to me getting cold. and cold tortellini is just not nice. so I'm going to go away for a bit and give pylint a chance to get its act together :-P [19:09:11] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2587166 (10mmodell) @yuvipanda: paramiko has that fixed in the latest version. [19:09:53] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2587174 (10yuvipanda) ah, nice. However, the fact that it took that long to fix it gives me cause for concern when the next deprecation happened. IIRC that patch to fix it w... [19:10:31] (03CR) 10Esanders: "that sounds like a -1 then" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [19:12:42] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2587180 (10mmodell) @yuvipanda yes, it took them a while and I am not at all suggesting that fabric is a good choice. I would advocate for clustershell or even mcollective. [19:17:42] (03PS3) 10ArielGlenn: enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 [19:17:52] yeah yeah dinner schminner. in a few minutes [19:21:55] (03PS4) 10ArielGlenn: enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 [19:22:05] speling fix [19:23:05] (03CR) 10ArielGlenn: [C: 032] enable pylint for tox on salt-misc directory [software] - 10https://gerrit.wikimedia.org/r/306971 (owner: 10ArielGlenn) [19:32:27] (03PS3) 10BBlack: text VCL: limited redirect for awful TLS negotiations [puppet] - 10https://gerrit.wikimedia.org/r/306935 [19:33:11] (03PS1) 10ArielGlenn: apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 [19:33:34] (03CR) 10BBlack: "In PS3, I dropped DHE ciphers from being selected for this. They're also not great, but still slightly better than non-FS and/or 3DES cho" [puppet] - 10https://gerrit.wikimedia.org/r/306935 (owner: 10BBlack) [19:34:17] (03CR) 10jenkins-bot: [V: 04-1] apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 (owner: 10ArielGlenn) [19:43:08] (03CR) 10BBlack: [C: 032] text VCL: limited redirect for awful TLS negotiations [puppet] - 10https://gerrit.wikimedia.org/r/306935 (owner: 10BBlack) [19:46:38] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2587265 (10kaldari) I'll create one... [20:02:59] (03PS1) 10Ppchelko: Redirect /api/rest_v1 to RESTBase docs page. [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) [20:03:57] (03CR) 10Ppchelko: "I have very limited understanding of what I'm doing here, so I guess there should be a better way to achieve that. Comments/help would be " [puppet] - 10https://gerrit.wikimedia.org/r/306979 (https://phabricator.wikimedia.org/T125226) (owner: 10Ppchelko) [20:09:51] (03PS2) 10ArielGlenn: apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 [20:10:46] (03CR) 10jenkins-bot: [V: 04-1] apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 (owner: 10ArielGlenn) [20:11:17] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4858891 keys - replication_delay is 633 [20:12:08] (03PS3) 10ArielGlenn: apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 [20:12:29] (03PS1) 10Andrew Bogott: Added filtertags to labs role descriptions. [puppet] - 10https://gerrit.wikimedia.org/r/306981 (https://phabricator.wikimedia.org/T91990) [20:13:09] (03CR) 10jenkins-bot: [V: 04-1] apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 (owner: 10ArielGlenn) [20:15:48] (03CR) 10Krinkle: toollabs: Convert puppet clone of cdnjs to cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) (owner: 10Madhuvishy) [20:16:35] (03PS4) 10ArielGlenn: apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 [20:17:36] (03CR) 10jenkins-bot: [V: 04-1] apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 (owner: 10ArielGlenn) [20:18:44] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2587385 (10Volker_E) ``` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDKpQEVtQw8XJCKvG7oNL9d+dveFqRpHCzduUvdzi6MK7zgtY9xUKF7ZcQObgP5fCpgLonDUu2b61WRzYgiPoBkhhnTKqRkCx6sT6KfsHaSE2iZefW5N7qEyRqpW... [20:19:49] (03PS5) 10ArielGlenn: apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 [20:19:58] come on flake8, don't be flaky on me [20:24:07] (03CR) 10ArielGlenn: [C: 032] apply flake8 to checkhosts scripts [software] - 10https://gerrit.wikimedia.org/r/306977 (owner: 10ArielGlenn) [20:31:36] PROBLEM - MariaDB disk space on db1047 is CRITICAL: DISK CRITICAL - free space: / 419 MB (5% inode=53%) [20:31:55] <_joe_> shit [20:32:29] _joe_: I think db1047 was like reimaged or is out of prod, let me check [20:32:44] <_joe_> it's just / [20:33:03] <_joe_> not /srv [20:33:17] 2.1 GB in /var/log maybe? [20:33:36] <_joe_> yes [20:33:58] 1.3Geventlogging_sync.log [20:34:22] (03PS2) 10Andrew Bogott: Added filtertags to labs role descriptions. [puppet] - 10https://gerrit.wikimedia.org/r/306981 (https://phabricator.wikimedia.org/T91990) [20:35:19] <_joe_> volans: are you looking into it? [20:35:39] I am [20:35:45] _joe_: yes, I think I might have rotated manually that synlog back in april :D [20:36:07] <_joe_> well, do it again :P [20:36:37] something used 300MB in / at 11:22 this morning [20:36:45] in ~1 minute [20:36:50] jynus: ^^^ [20:37:17] 11:22? [20:38:08] I do not know, analytics, which this host is, was going to do things on this and dbstore1002 [20:38:48] wait, in /, not /srv ? [20:38:58] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4852841 keys - replication_delay is 0 [20:39:55] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2587442 (10kaldari) Test case: https://commons.wikimedia.org/wiki/File:Mscorefonts_svg_rendering_test.svg Current screenshot of testcase... [20:40:24] <_joe_> jynus: yes [20:40:34] I am looking now [20:40:44] you can go away [20:41:12] <_joe_> jynus: volans already found a 1.3 GB log file [20:41:41] yes although that file grows slowly [20:41:43] oh, / only has 8 GB [20:41:56] I don't know yet what used 300MB in a minute this morning [20:42:00] and didn't free it after [20:45:49] there is also a deleted one but only 66MB : 66763968 390915 /var/log/account/pacct (deleted) [20:45:59] (03CR) 10Andrew Bogott: [C: 032] Added filtertags to labs role descriptions. [puppet] - 10https://gerrit.wikimedia.org/r/306981 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [20:46:35] jynus: ping me if you need a hand [20:47:23] what is worse is / is not on LVS [20:48:06] I think I will simlink to /srv/logs [20:48:21] db1047 has to die at some point [20:49:14] in /var/log/account there are 300MB if need a bit fo space [20:49:19] s/fo/of/ [20:50:41] 06Operations, 06Commons, 10Wikimedia-SVG-rendering, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2587467 (10kaldari) Note that Webdings may not actually render any dingbat symbols in that test depending on if the font is the Unicode v... [20:51:45] 07Puppet, 10Continuous-Integration-Infrastructure: Cant refresh Nodepool snapshot due to puppet: Could not find class passwords::puppet::database - https://phabricator.wikimedia.org/T143769#2587470 (10hashar) Git bisect on a Jessie image yields 8cfbc62e5d1c8657fd394728d1cf4d75952c91f3 ``` commit 8cfbc62e5d1c8... [20:54:17] RECOVERY - MariaDB disk space on db1047 is OK: DISK OK [20:54:42] 07Puppet, 10Continuous-Integration-Infrastructure: Cant refresh Nodepool snapshot due to puppet: Could not find class passwords::puppet::database - https://phabricator.wikimedia.org/T143769#2587505 (10hashar) Removing the include standard from the apache module fix it: ``` diff --git a/modules/apache/manifests... [20:56:53] 07Puppet, 10Continuous-Integration-Infrastructure: Cant refresh Nodepool snapshot due to puppet: Could not find class passwords::puppet::database - https://phabricator.wikimedia.org/T143769#2587506 (10hashar) [21:06:24] 90% of the space was occupied by kernels [21:11:15] 07Puppet, 10Continuous-Integration-Infrastructure: Cant refresh Nodepool snapshot due to puppet: Could not find class passwords::puppet::database - https://phabricator.wikimedia.org/T143769#2587537 (10hashar) @akosiaris I could use some assistance / idea on this one. For Nodepool I am provisioning lightweight... [21:20:28] volans: still around? [21:20:49] hashar: more or less :) [21:20:56] volans: thanks for the mail about operations/software.git :) I might send a few patches on a volunteer basis [21:21:10] somehow fixing flake8 errors is kind of entertaining [21:21:27] yeah I was replying to ariel now [21:22:15] oh he wrote to me only [21:22:42] kabal! [21:23:15] I was just saying that an autopep8 can solve automatically many of them but first you have to delete the .pep8 locally [21:23:23] otherwise it will skip the ignored ones ;) [21:23:55] we should get rid of that /.pep8 file [21:24:37] let's just do it then :) I think was useless too [21:27:22] more or less [21:27:33] (03PS2) 10Dereckson: Allow bureaucrats to manage account creators group on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) [21:27:39] the only useful bit is filename = *.py,geturls,swiftcleaner*,profiler-to-carbon [21:27:44] to catch files that do not end with .py [21:27:54] then they are all excluded so ... [21:28:08] (03CR) 10Luke081515: [C: 031] "LGTM (but needs rebase)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306595 (https://phabricator.wikimedia.org/T143844) (owner: 10Dereckson) [21:29:06] Dereckson: lol, take a look at grrrit-wm comment as 23:27: I updated simply the comit description, but he points that out like you did that [21:29:45] hashar: those are already in the tox file [21:29:57] and part of the procedure to opt-in... [21:30:31] it would be nice to have tox check the shebang of files without extensions but for now you have to "declare" them [21:30:44] Luke081515: I don't maintain grrrit-wm [21:31:06] probably the new gerrit version format [21:31:12] Dereckson: no, but it's kind of funny anyway :) [21:31:36] (03PS1) 10Hashar: Remove /.pep8 file: no more needed [software] - 10https://gerrit.wikimedia.org/r/307018 [21:31:59] Luke081515: check if by default bureaucrats can't add account creators [21:32:21] volans: yeah flake8 is quite lame [21:32:35] as well as most linting tools. They just look for the file suffix [21:32:48] (03CR) 10Volans: [C: 031] "LGTM" [software] - 10https://gerrit.wikimedia.org/r/307018 (owner: 10Hashar) [21:33:00] hashar: do you want me to +2? [21:33:09] yeah [21:33:19] (03CR) 10Volans: [C: 032] Remove /.pep8 file: no more needed [software] - 10https://gerrit.wikimedia.org/r/307018 (owner: 10Hashar) [21:33:33] neat :D [21:33:40] volans: and we might want to move the legacy/no more used scripts to some /attic/ directory [21:33:43] Luke081515: there is a new comment on the task bureaucrats can add but not remove the right [21:35:07] Dereckson: Actually they can't do anything with that right, but IIRC your patch already activates both? [21:35:39] hashar: in those cases there are 2 possible approaches... 1) don't touch, it might be in use somewhere or 2) move/delete it and see if anyone complains :D [21:36:55] Dereckson: I wonder why he commented that way, I think there isn't a merged patch? [21:37:51] so as there is no merged patch [21:38:01] could be default is bureaucrats can add that [21:39:19] hashar: zuul a bit delayed? [21:40:04] a bit busy yeah https://integration.wikimedia.org/zuul/ [21:40:33] ok [21:41:55] volans: it will land eventually [21:42:04] yeah no hurry [21:43:16] (03PS11) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [21:44:40] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [21:46:04] (03PS12) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [21:47:56] (03PS13) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [21:49:06] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [21:51:58] (03Merged) 10jenkins-bot: Remove /.pep8 file: no more needed [software] - 10https://gerrit.wikimedia.org/r/307018 (owner: 10Hashar) [21:52:12] volans: danke :) [21:52:26] yw :) [21:52:32] thank you! [21:57:04] hey [21:57:13] maybe I am the only one with a pylint running on a repo via tox [21:57:43] Volker_E: Do you have a labs account currently? If so what is your username there? [21:57:44] eventually I will get it running on all my code but... baby steps [21:58:35] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#2587598 (10Andrew) @Volker_E, do you have a labs account? If so, what is your username there? If not... please create one :) (We generally try to keep uids in sync between labs and prod) [22:00:49] volans: what is autopep8? [22:01:15] apergos: a tool that automatically try to fix pep8 styling errors [22:01:21] ohhhh [22:02:09] by default only spacing IIRC but you can "force" it to be more invasive [22:02:16] installed! [22:02:21] I suggest to run it on a clean repo [22:02:25] without pending changes [22:02:33] easier to review what it does [22:02:47] ah yes, "-a" for aggressive [22:02:48] and if you do a git pull you also get the deletion of the .pep8, just merged ;) [22:03:16] nice! [22:16:25] (03PS1) 10Yuvipanda: tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 [22:17:21] (03CR) 10jenkins-bot: [V: 04-1] tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 (owner: 10Yuvipanda) [22:17:53] (03PS2) 10Yuvipanda: tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 [22:19:27] yuvipanda: append with 2 p ;) [22:20:06] volans '2p'? [22:20:10] aaah [22:20:11] bah [22:20:25] I always fuck up append and accomodate and others [22:20:43] yuvipanda: yes but I guess that tee options accept only the correct one ;) [22:20:50] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [22:20:50] yeah [22:21:00] when I was testing it manually I hacked it and forgot to transfer it [22:21:41] (03PS3) 10Yuvipanda: tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 [22:22:58] (03PS4) 10Yuvipanda: tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 [22:23:20] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 4854314 keys - replication_delay is 0 [22:23:50] (03CR) 10Yuvipanda: toollabs: Convert puppet clone of cdnjs to cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/306958 (https://phabricator.wikimedia.org/T143637) (owner: 10Madhuvishy) [22:25:50] * volans going off ttyl [22:47:23] !log running html dmp for en wikipedia manually out of screen session (ariel) on francium [22:47:33] where is morebots [22:49:50] logged it manually [22:50:15] apergos, ^ there it is [22:50:20] heh [22:50:32] not sure why it broke [22:50:33] can it be that no one has logged anything in the last day but me? [22:50:34] anyways [22:50:34] had to jstart it [22:50:37] thank you [22:50:46] possibly, it is a friday [22:51:07] stashbot got it: https://tools.wmflabs.org/sal/production [22:51:14] (03CR) 10BryanDavis: tools: Add a wrapper script to enforce clush access (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/307026 (owner: 10Yuvipanda) [22:51:48] wow, morebots missed a lot [22:51:48] !log from a few hours ago: (08:42:47 μμ) bd808: !log Updated striker to fix T143956 [22:51:50] T143956: Yuvipanda can't connect his phab account - https://phabricator.wikimedia.org/T143956 [22:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:07] I think it'd be best to just copy/paste from stashbot to the wiki at this point [22:52:24] a lot happened since 15:13 yesterday [22:52:24] point taken [22:52:43] anyone in a timezone where it's not 2 am and willing to do it? :-D [22:52:44] statshbot and morebots have a symbiotic relationship. I've copied in both directions before [22:53:01] ohh sorry for the gratuituous ping there with the log, bd 808 [22:53:18] no worries. I was actually already here :) [22:53:18] he pings on stashbot, too, I'm sure :) [22:53:24] heh [22:53:44] I ping on a lot of stuff [22:53:48] * bd808 is an irc stalker [22:53:48] all right, I am going to stop looking at html dumps and start looking at bed, lights off, etc... [22:54:00] how do you look at a light that is off? [22:54:06] easy [22:54:12] may I demonstrate [22:54:17] please [22:54:25] the light is now off [22:54:28] I am looking at it [22:54:41] next step will be to move slowly away from the keyboard.... [22:54:45] (I was expecting you to just disappear at that point) [22:54:51] :) [22:54:55] heh [22:54:56] ttfn [22:57:22] greg-g: roshambo for the cleanup? [22:57:32] 1..2..3..rock! [22:57:48] (good old rock works every time) [22:59:15] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2587703 (10yuvipanda) [23:12:13] (03PS5) 10Yuvipanda: tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 [23:12:27] bd808 ^ updated [23:16:08] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2587726 (10yuvipanda) I'm going to pick up *some* of this next week, specifically stop allowing using role::puppet::self fr... [23:17:07] (03PS2) 10Yuvipanda: [WIP] base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 [23:17:25] (03PS3) 10Yuvipanda: base: Use the standard location for puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/257275 [23:17:53] (03PS4) 10Ori.livneh: [DNM] hack maintain-replicas.pl for adywiki/jamwiki [software] - 10https://gerrit.wikimedia.org/r/295564 (https://phabricator.wikimedia.org/T135029) [23:22:07] (03CR) 10BryanDavis: [C: 031] "I could quibble about using #!/bin/bash and not using [[ ]] for tests, but that's just a pet peeve." [puppet] - 10https://gerrit.wikimedia.org/r/307026 (owner: 10Yuvipanda) [23:24:43] (03PS6) 10Yuvipanda: tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 [23:24:54] bd808 updated about [[ :) [23:26:34] :) thx yuvipanda [23:35:18] (03PS1) 10Thcipriani: Bump scap version to 3.2.4-1 [puppet] - 10https://gerrit.wikimedia.org/r/307028 [23:43:15] hi [23:43:22] "Hard purge failed: [V8DT9ApAMDsAADeHxTwAAACF] Exception Caught: LinksUpdate::acquirePageLock: Cannot COMMIT to clear snapshot because writes are pending." At Wikisource [23:43:40] Also I've been noting it been taking up to 5 mins for changes in status to update in the Db [23:44:01] Can someone check to see if this is a local issue or a system wide problems? [23:44:49] The LinksUpdate exception is can be mostly ignored :) [23:45:38] Okay but the lag time on updates is of more concern... [23:45:44] Thought you shoulf know [23:45:46] * ShakespeareFan00 out [23:45:51] Joq queue looks ok. [23:45:54] *Job [23:54:48] (03CR) 10Yurik: [C: 04-1] Follow-up I049fa67: Remind people not to enable wgKartographerWikivoyageMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284483 (owner: 10Jforrester) [23:57:51] (03PS1) 10Yuvipanda: clush: Put clush config in correct location [puppet] - 10https://gerrit.wikimedia.org/r/307034 [23:58:34] (03CR) 10Yuvipanda: [C: 032 V: 032] clush: Put clush config in correct location [puppet] - 10https://gerrit.wikimedia.org/r/307034 (owner: 10Yuvipanda) [23:59:19] (03CR) 10Yuvipanda: [C: 032] tools: Add a wrapper script to enforce clush access [puppet] - 10https://gerrit.wikimedia.org/r/307026 (owner: 10Yuvipanda)