[00:00:04] addshore, hashar, anomie, no_justification, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180105T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:04:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Package/upload service-checker for Debian stretch - https://phabricator.wikimedia.org/T184224#3876861 (10dduvall) p:05Triage>03Normal [00:10:55] 10Operations, 10Ops-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876879 (10Jdforrester-WMF) [00:16:11] PROBLEM - HHVM rendering on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:01] RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 79437 bytes in 0.293 second response time [00:20:59] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3876913 (10RonaldB) Checked a few mails I received directly from Trijnstel. Microsoft recently used 40.92.65.66 to contact the post office of Google. A... [01:01:22] (03PS20) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:24:18] (03CR) 10Dzahn: [C: 031] "I tested this as follows:" [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [01:32:58] (03PS1) 10Dzahn: planet: move locales include out of module [puppet] - 10https://gerrit.wikimedia.org/r/402161 [01:33:20] (03CR) 10jerkins-bot: [V: 04-1] planet: move locales include out of module [puppet] - 10https://gerrit.wikimedia.org/r/402161 (owner: 10Dzahn) [01:34:15] (03PS2) 10Dzahn: planet: move locales include out of module [puppet] - 10https://gerrit.wikimedia.org/r/402161 [01:34:54] (03PS21) 10Aaron Schulz: [WIP] Add mcrouter module and mcrouter_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/392221 [01:35:36] 10Operations, 10DNS, 10Mail, 10Traffic: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3877038 (10Krenair) [01:37:54] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3877039 (10Krenair) p:05Triage>03Normal [01:37:59] (03CR) 10Dzahn: "delta is still 0 because locales::extended is neither role nor profile... always these chicken-egg problems :)" [puppet] - 10https://gerrit.wikimedia.org/r/402161 (owner: 10Dzahn) [01:38:33] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3877039 (10Krenair) [01:39:40] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cache-text04 due to varnishkafka issues - https://phabricator.wikimedia.org/T184234#3877051 (10Krenair) p:05Triage>03Normal [01:42:30] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877064 (10Krenair) p:05Triage>03Normal [01:43:04] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877064 (10Krenair) ```krenair@deployment-kafka03:~$ sudo puppet agent -tv Warning: Setting configtimeout is deprecated. (at /usr/lib/ruby/vendor_ruby/pup... [01:46:45] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877076 (10Krenair) 2.7G /var/log/daemon.log 2.6G /var/log/daemon.log.1 221M /var/log/kafka/controller.log 257M /var/log/kafka/kafka-mirror-main-deployment-pr... [01:51:13] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#3877079 (10Krenair) p:05Triage>03Normal [01:53:10] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-eventlogging04 due to missing repo on deployment-tin? - https://phabricator.wikimedia.org/T184238#3877100 (10Krenair) p:05Triage>03Normal [01:54:52] 10Puppet, 10Analytics, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka03 due to full disk - https://phabricator.wikimedia.org/T184235#3877117 (10Krenair) Repeat of T174742 ? [01:55:32] (03PS1) 10Dzahn: locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 [01:56:02] (03CR) 10jerkins-bot: [V: 04-1] locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 (owner: 10Dzahn) [01:59:58] (03PS2) 10Dzahn: locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 [02:00:52] (03CR) 10Dzahn: "can be fixed a better way after https://gerrit.wikimedia.org/r/#/c/402164/" [puppet] - 10https://gerrit.wikimedia.org/r/402161 (owner: 10Dzahn) [02:06:11] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3877125 (10Samwilson) @Qgil you probably know what you're up to, but give me a yell if I can help at all. [02:08:07] (03PS2) 10Dzahn: network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 [02:10:48] (03PS3) 10Dzahn: network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 [02:12:22] (03PS2) 10Dzahn: Releases: Install composer alongside Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/401804 (owner: 10Chad) [02:12:56] (03CR) 10Dzahn: [C: 032] Releases: Install composer alongside Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/401804 (owner: 10Chad) [02:12:58] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3877128 (10Krenair) p:05Triage>03Normal [02:14:32] (03CR) 10Dzahn: "Git::Clone[jenkins CI Composer]/Exec[git_clone_jenkins CI Composer]/returns: executed successfully" [puppet] - 10https://gerrit.wikimedia.org/r/401804 (owner: 10Chad) [02:15:06] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka-jump-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3877141 (10Krenair) p:05Triage>03Normal [02:17:44] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-trending01 due to removal of role - https://phabricator.wikimedia.org/T184241#3877153 (10Krenair) p:05Triage>03Normal [02:21:46] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-netbox, looks like it thinks its a prod box - https://phabricator.wikimedia.org/T184242#3877167 (10Krenair) p:05Triage>03Normal [02:27:52] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1515119269 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3120228 keys, up 3 minutes 1 seconds - replication_delay is 1515119269 [02:29:57] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-redis0[12] due to systemd on trusty - https://phabricator.wikimedia.org/T184243#3877182 (10Krenair) p:05Triage>03Normal [02:30:01] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 3113165 keys, up 5 minutes 4 seconds - replication_delay is 0 [02:31:00] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#3877193 (10Krenair) p:05Triage>03Normal [02:31:52] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6481 [02:33:01] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 3113756 keys, up 5 minutes 1 seconds - replication_delay is 0 [02:40:53] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3877211 (10Dzahn) @herron I don't know either, I don't remember an issue like this. But..i saw there is this file called "legacy_mailing_lists" and the... [02:42:51] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3034 mails in exim queue. [02:48:07] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-cache-text04 due to varnishkafka issues - https://phabricator.wikimedia.org/T184234#3877214 (10Krenair) hiera part: ```diff --git a/hieradata/labs/deployment-prep/host/deployment-cache-text04.yaml b/hieradata/labs/deployment-prep/host/deploym... [02:52:21] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3877215 (10Dzahn) Have you been able to contact the list administrator? Or is that one of you? Moderators-nl list run by taketawiki at hotmail.com, l.h... [02:56:39] 10Operations, 10Cloud-VPS, 10DNS, 10Traffic, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#3877228 (10Krenair) a:05Krenair>03None (alternatively we could just not use designate and inste... [03:02:09] (03PS7) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 [03:02:55] (03PS3) 10Dzahn: apache: add httpd module as a replacement [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [03:03:28] (03CR) 10jerkins-bot: [V: 04-1] httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [03:05:07] (03CR) 10Dzahn: [C: 032] "tested per above and being bold and just going ahead. nothing is using this as of now." [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [03:13:18] (03PS1) 10Dzahn: planet: switch from module apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402165 [03:14:37] (03CR) 10Chad: [C: 04-2] Gerrit 2.14.6 [software/gerrit] - 10https://gerrit.wikimedia.org/r/395820 (https://phabricator.wikimedia.org/T156120) (owner: 10Chad) [03:20:40] (03PS2) 10Dzahn: planet: switch from module apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402165 [03:21:19] (03CR) 10Dzahn: "one little change between the 2 modules is that the priority parameter now really just accepts integers and before it accepted strings too" [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [03:23:49] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9582/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/402165 (owner: 10Dzahn) [03:24:08] (03CR) 10Dzahn: "here's a diff of resources when switching a service to new module: http://puppet-compiler.wmflabs.org/9582/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [03:25:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 853.34 seconds [03:26:18] (03CR) 10Dzahn: "first use of this now here: https://gerrit.wikimedia.org/r/#/c/402165/" [puppet] - 10https://gerrit.wikimedia.org/r/400100 (owner: 10Giuseppe Lavagetto) [03:28:40] (03CR) 10Dzahn: "all that happened is Notice: /Stage[main]/Httpd/File[/etc/apache2/conf-enabled/50-server-status.conf]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/402165 (owner: 10Dzahn) [03:30:50] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/400100/ & https://gerrit.wikimedia.org/r/#/c/402165/ now actually do this in prod" [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [03:30:56] (03Abandoned) 10Dzahn: httpd: testing new module with planet (test only) [puppet] - 10https://gerrit.wikimedia.org/r/402118 (owner: 10Dzahn) [03:42:33] (03PS1) 10Dzahn: webserver_misc_static: switch apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402168 [03:47:24] (03PS2) 10Dzahn: webserver_misc_static: switch apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402168 [03:52:15] (03PS1) 10Chad: Undeploy EducationProgram from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402170 [03:52:17] (03CR) 10Chad: [C: 032] Undeploy EducationProgram from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402170 (owner: 10Chad) [03:53:45] (03Merged) 10jenkins-bot: Undeploy EducationProgram from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402170 (owner: 10Chad) [03:54:51] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Undeploy EducationProgram from test2wiki (duration: 00m 48s) [03:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:37] (03CR) 10jenkins-bot: Undeploy EducationProgram from test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402170 (owner: 10Chad) [04:00:05] (03PS3) 10Dzahn: webserver_misc_static: switch apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402168 [04:04:51] (03PS4) 10Dzahn: webserver_misc_static: switch apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402168 [04:07:01] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 230.97 seconds [04:08:40] (03PS38) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [04:10:59] (03PS5) 10Dzahn: webserver_misc_static: switch apache to module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402168 [04:11:55] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -10 :)" [puppet] - 10https://gerrit.wikimedia.org/r/402168 (owner: 10Dzahn) [04:21:42] (03PS1) 10Dzahn: peopleweb/publichtml: use module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402172 [04:26:12] (03PS2) 10Dzahn: peopleweb/publichtml: use module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402172 [04:36:58] (03CR) 10Dzahn: [C: 032] "wmf-style: total violations delta -5" [puppet] - 10https://gerrit.wikimedia.org/r/402172 (owner: 10Dzahn) [04:37:48] (03PS3) 10Dzahn: peopleweb/publichtml: use module httpd [puppet] - 10https://gerrit.wikimedia.org/r/402172 [04:44:54] (03PS1) 10Dzahn: peopleweb: re-add ferm rule for http [puppet] - 10https://gerrit.wikimedia.org/r/402173 [04:45:21] (03PS2) 10Dzahn: peopleweb: re-add ferm rule for http [puppet] - 10https://gerrit.wikimedia.org/r/402173 [04:46:39] (03CR) 10Dzahn: [C: 032] peopleweb: re-add ferm rule for http [puppet] - 10https://gerrit.wikimedia.org/r/402173 (owner: 10Dzahn) [05:13:01] (03PS4) 10Jayprakash12345: Turn on mapframe for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) [05:37:51] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.249 second response time [05:46:22] 10Operations, 10Cloud-VPS, 10monitoring: remove cloud VPS project 'ganglia' - https://phabricator.wikimedia.org/T183917#3877373 (10Andrew) >>! In T183917#3868330, @Dzahn wrote: > Thanks for the link Paladox, i wasn't aware of that search on tools.wmflabs. I guess i will just deleted the wiki pages? > > @A... [06:18:13] 10Operations, 10ops-codfw, 10DBA: db2054: Disk with predictive failure - https://phabricator.wikimedia.org/T183887#3877406 (10Marostegui) 05Open>03Resolved All good now - thank you! ``` root@db2054:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337FE1C0)... [06:19:22] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402178 (https://phabricator.wikimedia.org/T174569) [06:21:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402178 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:22:43] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402178 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:22:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402178 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [06:23:56] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 - T163190 (duration: 00m 51s) [06:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:07] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [06:24:42] !log Deploy schema change on db1094 - T174569 [06:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:56] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:36:18] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.131 second response time [06:41:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402182 (https://phabricator.wikimedia.org/T163190) [06:42:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#3877424 (10Legoktm) [06:45:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402182 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [06:47:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402182 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [06:47:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402182 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [06:48:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1098:3317 - T163190 (duration: 00m 27s) [06:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:28] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [06:49:15] !log Stop replication in sync on db1039 and db1098:3317 - T163190 [06:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:19] <_joe_> !log rebooting mw1261 [07:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:49] PROBLEM - Host mw1261 is DOWN: PING CRITICAL - Packet loss = 100% [07:29:18] RECOVERY - Host mw1261 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [07:31:58] 10Operations, 10DNS, 10Mail, 10Traffic: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3876973 (10Peachey88) I have a funny feeling that fundraising may be using @wikipedia.com aliases in emails. [07:37:57] <_joe_> !log rebooting mw1276 toio, kernel upgrade [07:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:19] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: Return code of 255 is out of bounds [07:40:48] PROBLEM - Host mw1276 is DOWN: PING CRITICAL - Packet loss = 100% [07:41:58] RECOVERY - Host mw1276 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [07:42:28] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 1.70, 0.51, 0.18 [07:49:15] 10Operations, 10DNS, 10Mail, 10Traffic: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230#3876973 (10grin) Whoever uses it should be covered by the SPF anyway, that's the point. wikimedia.org. 597 IN TXT "v=spf1 ip4:91.198.174.0/24 ip4:208.80.152.0/22 ip6:... [07:53:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402187 [07:53:40] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402187 (owner: 10Marostegui) [07:54:31] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402187 (owner: 10Marostegui) [07:55:28] (03PS1) 10Marostegui: db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402188 (https://phabricator.wikimedia.org/T174569) [08:14:02] 10Operations, 10Continuous-Integration-Config: tox 2.5.0 on phabricator-jessie-diffs fails with ERROR: Commands not specified - https://phabricator.wikimedia.org/T184060#3877476 (10hashar) The revert commit for 2.7.0 https://github.com/tox-dev/tox/issues/454 which looks like a hack when one can achieve exactly... [08:18:16] (03CR) 10Hashar: "Danke Schon!" [puppet] - 10https://gerrit.wikimedia.org/r/394555 (https://phabricator.wikimedia.org/T181799) (owner: 10Jdrewniak) [08:36:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402188 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:37:57] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402188 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:39:08] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 - T163190 (duration: 00m 28s) [08:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:23] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [08:43:18] (03CR) 10Gergő Tisza: "Are you sure? When I test this rule on vagrant, it creates a crontab entry with 42 2 * * * as expected, and the puppet source co" [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [08:51:37] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402188 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [08:54:32] !log ran git checkout modules/role/manifests/puppetmaster/standalone.pp on labs-puppetmaster.wikimedia.org to unblock sync from prod [08:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:43] (03PS5) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [08:54:56] cc: andrewbogott, madhuvishy, arturo --^ [08:55:11] 10Operations, 10Continuous-Integration-Config: tox 2.5.0 on phabricator-jessie-diffs fails with ERROR: Commands not specified - https://phabricator.wikimedia.org/T184060#3877497 (10fgiunchedi) 05Open>03Invalid Fair enough! Thanks @hashar ! [09:11:46] !log reboot ms-be1014 to test update stretch kernel [09:11:51] (03PS6) 10Smalyshev: Add loading DCAT-AP data into dcatap namespace on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/399954 (https://phabricator.wikimedia.org/T178978) [09:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:05] (03CR) 10Legoktm: [WIP] php7 manifests for mediawiki on stretch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [09:17:41] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3877516 (10mobrovac) [09:17:44] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services (done): Puppet broken on deployment-trending01 due to removal of role - https://phabricator.wikimedia.org/T184241#3877513 (10mobrovac) 05Open>03Resolved The instance has been deleted and its puppet prefix and web proxy cleaned up. [09:20:24] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment inline, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402119 (owner: 10BryanDavis) [09:20:52] !log drain and reboot analytics1030 for kernel updates [09:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:55] (03PS6) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [09:23:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402314 [09:23:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402314 (owner: 10Marostegui) [09:23:40] come on... [09:24:17] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool db1098:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402314 (owner: 10Marostegui) [09:24:24] (03PS6) 10Marostegui: db-eqiad.php: Point wikidatawiki to s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401436 (https://phabricator.wikimedia.org/T177208) [09:24:37] (03PS4) 10Marostegui: db-eqiad.php: Set s5 on read_only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/401434 (https://phabricator.wikimedia.org/T177208) [09:25:54] (03PS1) 10Marostegui: db-eqiad.php: Repool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402315 (https://phabricator.wikimedia.org/T163190) [09:27:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402315 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:29:39] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402315 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:29:51] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1098:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402315 (https://phabricator.wikimedia.org/T163190) (owner: 10Marostegui) [09:29:57] (03PS1) 10Elukey: role::analytics_cluster::hadoop::master: fix order in profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/402316 (https://phabricator.wikimedia.org/T167790) [09:30:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1098:3317 - T163190 (duration: 00m 27s) [09:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:46] T163190: Checksum data on s7 - https://phabricator.wikimedia.org/T163190 [09:38:55] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9588/" [puppet] - 10https://gerrit.wikimedia.org/r/402316 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [09:47:10] (03CR) 10Jcrespo: "I can deploy as is, but if it creates it with dayoftheweek => 0, I will revert it." [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [09:55:25] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3877577 (10fgiunchedi) [09:55:27] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port nutcracker statistics to Prometheus - https://phabricator.wikimedia.org/T181995#3877575 (10fgiunchedi) 05Open>03Resolved Indeed, so the problem is that we were trying to source `/etc/default/prometheus-nutcracker-exporter` file which was... [09:57:14] (03CR) 10TerraCodes: [C: 031] Turn on mapframe for Arabic Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [10:03:41] (03CR) 10Alexandros Kosiaris: [C: 031] network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 (owner: 10Dzahn) [10:09:41] (03PS1) 10Filippo Giunchedi: mtail: group invalid methods under a single metric [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) [10:14:05] !log reboot labsdb1009 [10:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:00] !log reboot restbase2004 to test kernel upgrade [10:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:45] akosiaris testing [10:16:07] wow ircecho sucks.... [10:16:39] akosiaris testing [10:16:39] akosiaris testing [10:17:13] hmm connection forcefully closed, messages not arriving, and it doesn't log a single thing [10:18:22] sigh [10:18:29] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:19:13] see log [10:23:30] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [10:27:33] (03PS1) 10Jcrespo: dbproxy: Switchover analytics-labsdb to labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/402320 [10:27:49] (03PS7) 10Gehel: Add loading DCAT-AP data into dcatap namespace on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/399954 (https://phabricator.wikimedia.org/T178978) (owner: 10Smalyshev) [10:28:56] (03CR) 10Gehel: [C: 032] Add loading DCAT-AP data into dcatap namespace on WDQS [puppet] - 10https://gerrit.wikimedia.org/r/399954 (https://phabricator.wikimedia.org/T178978) (owner: 10Smalyshev) [10:31:11] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:31:52] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:01] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:01] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:51] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:11] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:12] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:31] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:31] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:41] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:41] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:34:41] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:35:36] the failures don't seem to be related to my change... but checking... [10:36:10] puppetdb restarted on nitrogen 8m ago [10:36:37] OOMkill [10:36:42] elukey: thanks! [10:38:51] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:50:13] #0 0x00007fd89967b010 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 [10:50:16] how nice [10:50:30] even ircecho somehow managed to deadlock itself waiting on some semaphore [10:50:37] threads are hard ... [10:54:07] (03PS1) 10ArielGlenn: snapshot hosts: empty nfs server name means no mount [puppet] - 10https://gerrit.wikimedia.org/r/402322 [10:54:44] (03PS1) 10Elukey: profile::hadoop::*: include labs firewall use case [puppet] - 10https://gerrit.wikimedia.org/r/402323 (https://phabricator.wikimedia.org/T167790) [10:56:50] (03CR) 10ArielGlenn: [C: 032] snapshot hosts: empty nfs server name means no mount [puppet] - 10https://gerrit.wikimedia.org/r/402322 (owner: 10ArielGlenn) [10:59:06] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/9590/" [puppet] - 10https://gerrit.wikimedia.org/r/402323 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:59:11] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:12] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:59:31] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:31] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:41] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:41] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:41] RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:01:11] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:01:52] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:02:01] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:02:01] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:03:48] (03CR) 10Muehlenhoff: profile::hadoop::*: include labs firewall use case (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402323 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [11:04:51] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-kafka-jump-[12] due to version of a package being missing - https://phabricator.wikimedia.org/T184240#3877141 (10Paladox) Probably want to include the os too like Jessie or stretch? [11:07:08] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3877128 (10Paladox) Maybe stretch is pointing to an o... [11:10:50] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#3877079 (10Paladox) I guess we can apply this https://github.com/wikimedia/mediawiki-vagrant/commit/ac6d19df598c75d97b635b026763ae7fd96f5970 fix at /... [11:13:03] (03Abandoned) 10Elukey: profile::hadoop::*: include labs firewall use case [puppet] - 10https://gerrit.wikimedia.org/r/402323 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [11:15:28] (03PS2) 10Ema: mtail: group invalid methods under a single metric [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [11:17:00] (03PS1) 10Elukey: network::constants: add fake analytics networks for labs [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) [11:21:42] (03PS3) 10Ema: mtail: group invalid methods under a single metric [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [11:26:19] (03PS1) 10Filippo Giunchedi: Run mtail tests via tox->rake->nose [puppet] - 10https://gerrit.wikimedia.org/r/402325 (https://phabricator.wikimedia.org/T181794) [11:26:39] (03PS2) 10Filippo Giunchedi: Run mtail tests via rake->tox->nose [puppet] - 10https://gerrit.wikimedia.org/r/402325 (https://phabricator.wikimedia.org/T181794) [11:27:01] (03PS2) 10Jcrespo: dbproxy: Switchover analytics-labsdb to labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/402320 [11:27:27] (03CR) 10Jcrespo: [C: 032] dbproxy: Switchover analytics-labsdb to labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/402320 (owner: 10Jcrespo) [11:31:50] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3877801 (10Qgil) Thanks! One thing is to fiddle in your own server and another thing is to do the same in a Cloud instance with other admins... [11:33:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Run mtail tests via rake->tox->nose [puppet] - 10https://gerrit.wikimedia.org/r/402325 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [11:34:55] (03CR) 10Ema: [C: 031] Run mtail tests via rake->tox->nose [puppet] - 10https://gerrit.wikimedia.org/r/402325 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [11:35:49] (03PS3) 10Filippo Giunchedi: Run mtail tests via rake->tox->nose [puppet] - 10https://gerrit.wikimedia.org/r/402325 (https://phabricator.wikimedia.org/T181794) [11:37:00] (03CR) 10Filippo Giunchedi: [C: 032] Run mtail tests via rake->tox->nose [puppet] - 10https://gerrit.wikimedia.org/r/402325 (https://phabricator.wikimedia.org/T181794) (owner: 10Filippo Giunchedi) [11:37:22] jynus: merging your patch too [11:37:29] thanks [11:37:48] np [11:38:29] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [11:43:14] (03PS2) 10Elukey: profile::hadoop:*: add ferm srange defaults to allow labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) [11:44:10] (03CR) 10Ema: [C: 031] mtail: group invalid methods under a single metric [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [11:45:11] (03CR) 10Filippo Giunchedi: [C: 032] mtail: group invalid methods under a single metric [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) (owner: 10Filippo Giunchedi) [11:45:16] (03PS4) 10Filippo Giunchedi: mtail: group invalid methods under a single metric [puppet] - 10https://gerrit.wikimedia.org/r/402318 (https://phabricator.wikimedia.org/T183926) [11:45:50] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3877831 (10Qgil) >>! In T180854#3861707, @Qgil wrote: >> debian-8.2-jessie (deprecated 2016-02-16) > > Should we start by upgrading the OS?... [11:48:25] (03PS3) 10Elukey: profile::hadoop:*: add ferm srange defaults to allow labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) [11:49:49] 10Operations, 10IRCecho, 10monitoring: ircecho doesn't reconnect on failure - https://phabricator.wikimedia.org/T184103#3877845 (10akosiaris) After some minor changes here and there I did a gdb on the thing and after forcefully closing the TCP connectio to the IRC server we get ``` (gdb) bt #0 0x00007fd899... [11:51:09] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/9592/" [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [11:55:38] (03PS1) 10Jcrespo: dbproxy: Switchover labsdb1009 to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/402327 [11:57:17] (03CR) 10Jcrespo: [C: 032] dbproxy: Switchover labsdb1009 to labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/402327 (owner: 10Jcrespo) [12:03:08] !log reboot cp1008 into linux 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [12:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:12] !log upgrade and restart labsdb1011 [12:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:35] (03PS1) 10Filippo Giunchedi: prometheus: aggregate varnish_requests rate [puppet] - 10https://gerrit.wikimedia.org/r/402328 (https://phabricator.wikimedia.org/T177199) [12:08:41] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [12:10:10] see lok [12:10:14] *see log [12:14:41] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [12:16:24] (03PS1) 10Jcrespo: Revert "dbproxy: Switchover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/402339 [12:20:23] (03CR) 10Jcrespo: [C: 032] Revert "dbproxy: Switchover labsdb1011 to labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/402339 (owner: 10Jcrespo) [12:30:15] (03PS1) 10Ema: varnishmtail: specify reload action [puppet] - 10https://gerrit.wikimedia.org/r/402342 (https://phabricator.wikimedia.org/T177199) [12:40:25] (03PS4) 10Alexandros Kosiaris: ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 (https://phabricator.wikimedia.org/T184103) [12:40:27] (03PS2) 10Alexandros Kosiaris: ircecho: Force unbuffered stdin/stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/402101 (https://phabricator.wikimedia.org/T184103) [12:40:29] (03PS1) 10Alexandros Kosiaris: ircecho: Normalize print statements [puppet] - 10https://gerrit.wikimedia.org/r/402343 (https://phabricator.wikimedia.org/T184103) [12:40:31] (03PS1) 10Alexandros Kosiaris: ircecho: set EchoNotifier threads as daemon [puppet] - 10https://gerrit.wikimedia.org/r/402344 (https://phabricator.wikimedia.org/T184103) [12:40:49] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [12:40:54] volans|off: finally solved it (I think) https://gerrit.wikimedia.org/r/#/q/topic:ircecho_cleanups+(status:open+OR+status:merged) [12:40:58] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Force unbuffered stdin/stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/402101 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [12:41:05] (03CR) 10jerkins-bot: [V: 04-1] ircecho: Normalize print statements [puppet] - 10https://gerrit.wikimedia.org/r/402343 (https://phabricator.wikimedia.org/T184103) (owner: 10Alexandros Kosiaris) [12:43:06] !log upgrade cp3007 to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [12:43:08] 10Operations, 10IRCecho, 10monitoring, 10Patch-For-Review: ircecho doesn't reconnect on failure - https://phabricator.wikimedia.org/T184103#3877984 (10akosiaris) After some experimentation it looks like the main thread is just waiting for the other threads to terminate. This can never happen in normal cond... [12:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:18] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [12:44:08] (03PS2) 10Giuseppe Lavagetto: labtestservices2001: one role() call [puppet] - 10https://gerrit.wikimedia.org/r/401555 [12:44:10] (03PS1) 10Giuseppe Lavagetto: wmflib: simplify the role() function, convert to the new API [puppet] - 10https://gerrit.wikimedia.org/r/402345 [12:44:12] (03PS1) 10Giuseppe Lavagetto: hiera: port nuyaml to hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/402346 [12:44:14] (03PS1) 10Giuseppe Lavagetto: hiera: first step of simplification [puppet] - 10https://gerrit.wikimedia.org/r/402347 [12:44:48] !log reboot kafka-jumbo1001 for kernel updates [12:44:48] (03CR) 10jerkins-bot: [V: 04-1] hiera: first step of simplification [puppet] - 10https://gerrit.wikimedia.org/r/402347 (owner: 10Giuseppe Lavagetto) [12:44:53] <_joe_> akosiaris: I'd like to hear your opinion on ^^ [12:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:27] (03CR) 10Elukey: [C: 032] profile::hadoop:*: add ferm srange defaults to allow labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [12:47:32] (03PS4) 10Elukey: profile::hadoop:*: add ferm srange defaults to allow labs deployments [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) [12:47:39] _joe_: the new httpd module looks really cool btw [12:48:25] <_joe_> akosiaris: I've seem mutante merged it and ported quite a few things to it [12:49:01] <_joe_> which is very cool :) [12:49:12] <_joe_> tbh, we should really embrace puppet 4 [12:49:29] <_joe_> at least in terms of "typing" class/define signatures [12:50:01] (03PS1) 10Alexandros Kosiaris: httpd: Fix long line in rspec [puppet] - 10https://gerrit.wikimedia.org/r/402350 [12:50:08] personally I am thinking about enabling strict_variables soon [12:50:34] looks like it's not the difficult ... the issues are less than a 50 [12:50:35] <_joe_> that could be a problem with hiera and role() [12:50:45] <_joe_> that's the hard one :) [12:50:50] yeah I don't know about role() [12:50:58] but we already have warnings [12:51:12] so if we fix all these plus enable it in pcc [12:51:23] <_joe_> akosiaris: I want to move on with data_binding_terminus = none at least in a specific environment [12:51:37] (03CR) 10Alexandros Kosiaris: [C: 032] httpd: Fix long line in rspec [puppet] - 10https://gerrit.wikimedia.org/r/402350 (owner: 10Alexandros Kosiaris) [12:52:18] (03PS3) 10Giuseppe Lavagetto: labtestservices2001: one role() call [puppet] - 10https://gerrit.wikimedia.org/r/401555 [12:52:59] it's deprecated as a setting btw [12:53:05] https://tickets.puppetlabs.com/browse/PUP-6576 [12:56:58] <_joe_> akosiaris: have you noticed how the pupppet dev didn't get what the user talking about data_binding_terminus: none was saying [12:57:13] lol [12:57:36] <_joe_> "I know we have a few folks that are setting data_binding_terminus to none in order to disable automatic parameter lookup. What changes will they have to make in order to keep lookup disabled at the global and environment levels once this change lands?" [12:57:44] <_joe_> "Thanks Charlie, 'none' is a better option than blank/nil. Will change in the impl I am working on right now. [12:57:47] <_joe_> To not get any env lookup, simply do not have a lookup.yaml in the env, or use one with an empty hierarchy. [12:57:50] <_joe_> " [12:57:52] <_joe_> wtf? [12:58:24] <_joe_> but then the commit does the right thing [12:58:25] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3878008 (10Paladox) According to https://www.neowin.net/news/ubuntu-will-fix-meltdown-and-spectre-by-january-9th Ubuntu plans to release a fix... [12:58:28] <_joe_> lol [12:59:38] (03PS7) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [13:04:09] <_joe_> apergos: should I take a look or wait? [13:05:19] (03CR) 10Giuseppe Lavagetto: [C: 032] labtestservices2001: one role() call [puppet] - 10https://gerrit.wikimedia.org/r/401555 (owner: 10Giuseppe Lavagetto) [13:05:34] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [13:06:19] _joe_: it's pretty soon [13:06:32] we don't have some extensions built yet even for php7, it seems [13:06:44] it's ok for me to start testing with for dumps but that's about it [13:07:31] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3877039 (10Paladox) This is probaly because puppet has been broken on this host for a long while now. Probaly needs to be recreated or deleted. It’s been disconnected from getting any changes... [13:07:54] <_joe_> no I meant to that specific puppet patch [13:08:10] <_joe_> please ping me for a review when it's ready [13:08:21] <_joe_> :) [13:08:37] <_joe_> and yes, we might need to rewrite parts of luasandbox et al. [13:09:31] (03PS1) 10Ema: varnishmtail: notify daemons upon mtail program modification [puppet] - 10https://gerrit.wikimedia.org/r/402353 (https://phabricator.wikimedia.org/T177199) [13:09:40] akosiaris: thanks for taking care of it, I can review them on Mon. if not yet merged by then [13:10:41] yes I too meant the specific patch [13:10:55] I will ping when it's a bit closer to something :-) [13:12:32] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [13:14:19] (03PS1) 10Elukey: profile::hadoop::firewall::master: fix default ferm srange [puppet] - 10https://gerrit.wikimedia.org/r/402354 (https://phabricator.wikimedia.org/T166248) [13:14:55] (03CR) 10Elukey: [C: 032] profile::hadoop::firewall::master: fix default ferm srange [puppet] - 10https://gerrit.wikimedia.org/r/402354 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [13:19:40] !log deploying Analytics Query Service [13:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:53] 10Operations: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3878085 (10ArielGlenn) p:05Triage>03Normal [13:22:21] !log rebooting elastic1017 for kernel upgrade [13:22:24] (03CR) 10Filippo Giunchedi: "PCC broken due to missing vk certs? https://puppet-compiler.wmflabs.org/compiler02/9593/cp1008.wikimedia.org/change.cp1008.wikimedia.org.e" [puppet] - 10https://gerrit.wikimedia.org/r/402353 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [13:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:49] !log fdans@tin Started deploy [analytics/aqs/deploy@792c95d]: (no justification provided) [13:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:13] 10Operations: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3878103 (10MoritzMuehlenhoff) @Legoktm already prepared a stretch-backports upload of php-luasandbox, so we can use that one. We could update wikidiff2 in stretch-backports to 1.5.1-3 and stick... [13:24:21] !log fdans@tin Finished deploy [analytics/aqs/deploy@792c95d]: (no justification provided) (duration: 01m 32s) [13:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:49] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3878124 (10fgiunchedi) [13:25:51] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Limit http methods reported by varnishmtail - https://phabricator.wikimedia.org/T183926#3878122 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [13:26:41] 10Operations: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3878131 (10ArielGlenn) Works for me, but the actual users of these packages should probably weigh in ;-) [13:32:47] (03PS2) 10Filippo Giunchedi: prometheus: aggregate varnish_requests rate [puppet] - 10https://gerrit.wikimedia.org/r/402328 (https://phabricator.wikimedia.org/T177199) [13:33:45] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: aggregate varnish_requests rate [puppet] - 10https://gerrit.wikimedia.org/r/402328 (https://phabricator.wikimedia.org/T177199) (owner: 10Filippo Giunchedi) [13:37:27] !log rebooting wdqs1003 for kernel upgrade [13:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:02] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3878207 (10fgiunchedi) Status update: * varnishstats has been replaced with varnishmtail-backend to get a breakdown of stat... [13:49:17] (03CR) 10Ottomata: profile::hadoop:*: add ferm srange defaults to allow labs deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402324 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [13:52:49] !log fdans@tin Started deploy [analytics/aqs/deploy@792c95d]: (no justification provided) [13:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:07] !log fdans@tin Finished deploy [analytics/aqs/deploy@792c95d]: (no justification provided) (duration: 00m 18s) [13:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#3878217 (10chasemp) These seem pretty close, any chance they are on the agenda for early next week? We are looking at a potential resource crunch for CPU and these would be heartwarmi... [13:54:29] !log upgrade cp3046 to latest jessie point release (8.10) T182656 and linux kernel 4.9.65-3+deb9u1~bpo8+2 (KPTI) T184267 [13:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:39] T182656: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656 [13:56:33] !log elukey@tin Started deploy [analytics/aqs/deploy@792c95d]: Add pageviews by country endpoint [13:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:46] !log elukey@tin Finished deploy [analytics/aqs/deploy@792c95d]: Add pageviews by country endpoint (duration: 01m 12s) [13:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:38] (03PS1) 10Andrew Bogott: nova scheduler pool: Add some comments so I remember which hosts are for infra [puppet] - 10https://gerrit.wikimedia.org/r/402356 [14:03:47] !log fdans@tin (no justification provided) [14:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:30] ottomata, elukey: pcc is apparently broken on cp1008 due to kafka::webrequest::jumbo https://puppet-compiler.wmflabs.org/compiler03/9596/cp1008.wikimedia.org/change.cp1008.wikimedia.org.err [14:04:39] puppet runs fine on the host though [14:05:08] ema: ah it's my fault! haven't updated the labs private repo! [14:05:17] fixing it [14:05:24] elukey: thanks :) [14:07:30] (03PS1) 10Elukey: Move varnishkafka.key.pem to varnishkafka.key.private.pem [labs/private] - 10https://gerrit.wikimedia.org/r/402357 [14:07:46] (03CR) 10Elukey: [V: 032 C: 032] Move varnishkafka.key.pem to varnishkafka.key.private.pem [labs/private] - 10https://gerrit.wikimedia.org/r/402357 (owner: 10Elukey) [14:08:12] ema: feel free to retry, it should work [14:09:30] (03CR) 10Filippo Giunchedi: [C: 031] varnishmtail: notify daemons upon mtail program modification [puppet] - 10https://gerrit.wikimedia.org/r/402353 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:13:27] elukey: confirmed <3 [14:13:48] <# [14:13:50] <3 [14:16:16] (03Abandoned) 10Ema: varnishmtail: specify reload action [puppet] - 10https://gerrit.wikimedia.org/r/402342 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:17:53] !log reboot maps1002 for kernel upgrade [14:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:54] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1002.eqiad.wmnet [14:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:52] (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki-maintenance: Run maintenance on new s8 replica set, too [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) (owner: 10Jcrespo) [14:21:55] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Cloud: Labvirt and instance reboots for Meltdown - https://phabricator.wikimedia.org/T184189#3878325 (10chasemp) > OS_TENANT_NAME=testlabs openstack server create --flavor 2 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-z... [14:22:06] (03PS2) 10Ema: varnishmtail: notify daemons upon mtail program modification [puppet] - 10https://gerrit.wikimedia.org/r/402353 (https://phabricator.wikimedia.org/T177199) [14:22:10] (03CR) 10Ema: [V: 032 C: 032] varnishmtail: notify daemons upon mtail program modification [puppet] - 10https://gerrit.wikimedia.org/r/402353 (https://phabricator.wikimedia.org/T177199) (owner: 10Ema) [14:25:13] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1002.eqiad.wmnet [14:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:52] :) [14:29:54] (03PS2) 10Jcrespo: mediawiki-maintenance: Run maintenance on new s8 replica set, too [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) [14:29:57] Getting an odd caching issue with - https://species.wikimedia.org/w/index.php?title=Special:LintErrors/missing-end-tag&dir=prev&namespace=0 [14:30:02] I can click to reload it and something that should have updated hasn't [14:30:25] In that the results still reflect the status of a page BEFORE it was edited. [14:30:30] (03CR) 10Jcrespo: [C: 032] mediawiki-maintenance: Run maintenance on new s8 replica set, too [puppet] - 10https://gerrit.wikimedia.org/r/402047 (https://phabricator.wikimedia.org/T184179) (owner: 10Jcrespo) [14:32:40] 10Operations, 10ops-eqiad, 10DC-Ops: cp1066's DRAC not responding to SSH - https://phabricator.wikimedia.org/T184196#3878359 (10ema) a:03Cmjohnson [14:33:58] I'm also finding I have to save pages TWICE to get things to update [14:35:31] ShakespeareFan02: Special Pages dont always update instantly, this is due to caching and iirc normal [14:35:48] ShakespeareFan02 are you sure you need to save it twice, or just wait a few seconds? [14:36:00] It would be nice to have some indication of lag times? [14:36:02] it is relatively normal that some jobs are delayed a bit [14:36:16] otherwise, saving would take minutes [14:37:04] e.g. categorization of pages, a common complain, is not instant [14:37:08] Normally it takes about 30 secs for the page linked to update [14:37:18] Today it's taking over an hour [14:37:42] ShakespeareFan02: let's look at the jobque delay, it is public [14:37:50] but it updates almost instantly for specific entires if they are saved twice with null-saves [14:38:17] Lint-Errors is quite a big task anyway [14:38:20] :( [14:38:26] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1 [14:39:27] it seems things are quite busy at the momment with 10 millon jobs queued [14:40:28] for linter (I assume it is that one) it seems to take an average of 20 seconds to complete: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&panelId=7&fullscreen&var-jobType=RecordLintJob [14:41:26] I saw some query errors about linter statistics in the past, ShakespeareFan02, does that page provide such statistics? [14:41:27] That doesn't explain why stuff I edited at 12:41 is still showing up at 14:41 [14:41:39] Saving Twice should NOT be needed [14:41:52] I wanted to report that seems that many liter SQL queries fail [14:42:07] and it could be that a second time they work because of warm buffers [14:42:15] This is not acceptable [14:42:21] I would report such a thing to the maintainer [14:42:43] Ideally, you shouldn't have to Save twice [14:42:52] To get what should be a routine update to do [14:42:54] (sigh) [14:42:56] a report will likely get a fix by a person that is in charge of that [14:43:11] complaining here probably will not :-D [14:43:25] Whose the maintainer on that special page then>? [14:43:26] if you create a bug report [14:43:53] I can add the things I observed [14:43:57] In other words... Unless I am prepared to do my own debugging... nothing changes... [14:44:00] bye [14:45:09] not sure what I take from this conversation, I was just trying to help [14:55:53] I have created https://phabricator.wikimedia.org/T184280 [14:59:11] <_joe_> jynus: meh, I wouldn't have bothered, after such an interaction, tbg [14:59:14] <_joe_> *tbh [14:59:40] <_joe_> that is absolutely unacceptable and people need to learn to beahve with each other with civility [14:59:47] <_joe_> esp since you were trying to help [14:59:59] yes, but if there is someone that has to be reasonable is us/me [15:00:01] <_joe_> also, this is not an helpdesk [15:00:24] I have made a new year's resolution to try to be friendlier [15:00:29] <_joe_> and you were [15:00:31] <_joe_> :) [15:01:04] also looking from other's people perspective- proably the person was frustrated [15:01:30] but maybe also afraid of technical stuff? We have to give the benefit of doubt of good intentions [15:01:56] <_joe_> no I'm sorry the way the conversation closed was simply unacceptable [15:02:37] <_joe_> even with all the benefits, that might no longer apply to that specific person, btw [15:02:52] everybody has a bad day... or two [15:03:02] :-) [15:07:19] (03PS5) 10Alexandros Kosiaris: ircecho: Remove redundant thread [puppet] - 10https://gerrit.wikimedia.org/r/402081 (https://phabricator.wikimedia.org/T184103) [15:07:21] (03PS3) 10Alexandros Kosiaris: ircecho: Force unbuffered stdin/stdout/stderr [puppet] - 10https://gerrit.wikimedia.org/r/402101 (https://phabricator.wikimedia.org/T184103) [15:07:23] (03PS2) 10Alexandros Kosiaris: ircecho: Normalize print statements [puppet] - 10https://gerrit.wikimedia.org/r/402343 (https://phabricator.wikimedia.org/T184103) [15:07:26] (03PS2) 10Alexandros Kosiaris: ircecho: set EchoNotifier threads as daemon [puppet] - 10https://gerrit.wikimedia.org/r/402344 (https://phabricator.wikimedia.org/T184103) [15:10:08] (03PS3) 10Alexandros Kosiaris: admin: Flatten 2 levels of arrays in unique_users [puppet] - 10https://gerrit.wikimedia.org/r/401701 [15:10:15] (03CR) 10Alexandros Kosiaris: [C: 032] admin: Flatten 2 levels of arrays in unique_users [puppet] - 10https://gerrit.wikimedia.org/r/401701 (owner: 10Alexandros Kosiaris) [15:12:06] (03PS1) 10Cmjohnson: Adding production dns labvirt1021/22 [dns] - 10https://gerrit.wikimedia.org/r/402361 (https://phabricator.wikimedia.org/T183937) [15:21:55] (03PS1) 10Ottomata: Create cdh::zookeeper class and specify version [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402363 [15:22:13] (03CR) 10jerkins-bot: [V: 04-1] Create cdh::zookeeper class and specify version [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402363 (owner: 10Ottomata) [15:22:17] elukey: a little funky, but I think ^ will do it [15:22:39] (03PS2) 10Ottomata: Create cdh::zookeeper class and specify version [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402363 [15:27:52] ottomata: I am wondering if we could avoid to specify a zk version in the cdh module, and do something like apt::pin in the various profiles [15:28:02] or maybe in a profile that gets included [15:29:00] ah but it is used in other cdh classes uff [15:29:23] maybe we can have a cdh class parameter for the version, and use a profile to configure it? [15:29:24] yeah, i mean [15:29:27] we could set the default to present [15:29:37] and then somehow provide the version via hiera [15:29:40] but then we'd have to pass it down [15:29:44] if we had cdh module hiera... [15:29:48] heheheh it'd be easy [15:29:59] common/cdh/zookeeper.yaml: [15:30:00] ensure: '...' [15:30:01] buut nawww [15:30:26] elukey: agree it would be better elsewhere, but i think i want to take a pass at doing some refactors of this stuff too...something is weird [15:30:36] standby, master, worker have some commonalities [15:30:44] but we are separating and duplicating some things [15:30:47] i think there is a way...not sure though [15:30:51] but it would take a while to figure out [15:31:51] elukey: if we could figure out how to make an apt::pin for cdh without a speciifc zookeeper version [15:31:53] that could be nice [15:32:07] just a more generic [15:32:30] profile::hadoop::apt_pin class that could be included where we need [15:32:42] dunno if that's possible though, since the package verisons vary between packages [15:32:51] we just need the 'cdh' part of the version [15:33:28] I think that apt::pin does what we need, like [15:33:29] apt::pin {'reprepro': [15:33:29] pin => 'release a=jessie-backports', [15:33:29] priority => '1001', [15:33:29] before => Package['reprepro'], [15:33:31] } [15:33:46] (03CR) 10Debt: [C: 031] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/400682 (https://phabricator.wikimedia.org/T183764) (owner: 10Jayprakash12345) [15:36:30] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3878468 (10Marostegui) [15:36:30] hmmm [15:37:38] (03PS5) 10Alexandros Kosiaris: Add all ops members to docker group [puppet] - 10https://gerrit.wikimedia.org/r/401492 [15:37:40] (03PS1) 10Alexandros Kosiaris: servermon: Add HOST_MAX_INACTIVE_DAYS setting [puppet] - 10https://gerrit.wikimedia.org/r/402366 [15:39:04] (03PS2) 10Alexandros Kosiaris: servermon: Add HOST_MAX_INACTIVE_DAYS setting [puppet] - 10https://gerrit.wikimedia.org/r/402366 [15:41:51] !log Upgrade db2072 (mariadb and kernel) - T184256 [15:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:41] elukey: i like it, i think we can do that [15:43:56] i'd prefer to put the apt::pin into the cdh module, buuut then we'd be including a module from another [15:44:00] i guess i'll do a profile [15:44:25] and then we can force the apt::pin before the rest [15:44:25] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Add HOST_MAX_INACTIVE_DAYS setting [puppet] - 10https://gerrit.wikimedia.org/r/402366 (owner: 10Alexandros Kosiaris) [15:44:40] so the ensure should do the work as intended [15:44:45] without specifying any version [15:44:47] before => Class['cdh::hadoop'], [15:44:48] should do it [15:45:04] ack, do you want to update the code review with an attempt? [15:45:59] doing ... [15:46:00] ya [15:48:12] !log rebooting multatuli for kernel update [15:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:21] (03PS1) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 [15:49:27] elukey: ^? [15:50:54] hmm might need higher priority [15:50:57] !log Upgrade db2071 kernel - T184256 [15:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:52] (03PS2) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 [15:52:01] ottomata: yeah I was about to say that, like 1002 [15:52:20] also, would it be ok to have require profile::hadoop::apt_pin rather than the before => etc.. ? [15:52:33] hmmm [15:53:50] was 2055 the one that had partially failed, or was it another? [15:53:55] *db2055 [15:54:09] i think i like the before better here, because then the class itself can insert the dependency. i think i'd be fine with depending on the profile::hadoop::common if you like, or declaring the dependency in the class, i.e. Class['profile::hadoop::common'] -> Class['profile::hadoop::apt_pin'] [15:54:26] elukey: i'm testing this pin manually on your hadoop-master-1 [15:54:29] jynus: what do you mean partially? [15:54:48] there was a predictive failure somewhere on codfw [15:54:51] ah [15:54:53] db2054 [15:55:00] which was already fixed :) [15:55:01] so different host, almost the same [15:55:16] now is db2055 with another disk [15:55:29] predicitve failure again? [15:55:35] not this time [15:56:05] those disk failure take a long time on soft [15:56:10] because performance reasons [15:56:23] hmmm not sure if it is working [15:56:54] probably the detection doesn't work until it is hard [15:57:08] yeah, it is already showing failed on hpssacli [15:57:12] ottomata: I prefer the profile::hadoop::common solution but I'll let you decide [15:57:14] oh elukey, we already have a profile::cdh::apt :) [15:57:26] HMMM but [15:57:26] hmmm [15:57:39] db2071 is you, marostegui? [15:58:10] yeah, see SAL :) [15:58:23] oh, I just missed that [15:58:34] did you upgrade to 10.1.30 too? [15:58:42] I actually double checked, you made me doubt if I logged it [15:58:46] (03PS1) 10Alexandros Kosiaris: servermon: Remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/402371 [15:58:50] Yeah, db2072 yes, kernel+10.1.30 [15:58:54] cool [15:59:00] (03PS1) 10Alexandros Kosiaris: servermon: Use the new WSGI invocation pattern [puppet] - 10https://gerrit.wikimedia.org/r/402372 [15:59:03] openssl I think too [15:59:04] db2071 only kernel - already running 10.0.33 [15:59:15] yeah, I did a full apt full-upgrade :) [15:59:16] what? [15:59:39] I do not undertand, isn't that stretch [15:59:45] db2072 yes, db2071 no [15:59:52] I got confused [15:59:57] :-( [16:00:08] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Remove tests directory [puppet] - 10https://gerrit.wikimedia.org/r/402371 (owner: 10Alexandros Kosiaris) [16:00:08] hehe, I touched two servers db2072 and db2071 [16:00:11] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Use the new WSGI invocation pattern [puppet] - 10https://gerrit.wikimedia.org/r/402372 (owner: 10Alexandros Kosiaris) [16:00:17] "db2071 only kernel - already running 10.0.33", so that was db2072? [16:00:46] jynus: no, db2072 (kernel+10.1.30), db2071 (kernel, it was already running the latest mariadb for it, 10.0.33) [16:01:16] ah, I get it now [16:01:28] :-) [16:02:00] db2071, db2072; db2054, db2055 it is confusing [16:02:09] haha and on a friday evening! [16:04:40] !log akosiaris@tin Started deploy [servermon/servermon@3c8538a]: Update servermon to 3c8538a [16:04:48] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.367 second response time [16:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:03] !log akosiaris@tin Finished deploy [servermon/servermon@3c8538a]: Update servermon to 3c8538a (duration: 00m 23s) [16:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:42] (03PS3) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 [16:06:48] !log akosiaris@tin Started deploy [servermon/servermon@3c8538a]: Update servermon to 3c8538a [16:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:28] PROBLEM - HP RAID on db2055 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:3 - OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [16:07:30] ACKNOWLEDGEMENT - HP RAID on db2055 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:3 - OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T184285 [16:07:34] 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3878516 (10ops-monitoring-bot) [16:07:39] there we go [16:07:40] !log akosiaris@tin Started deploy [servermon/servermon@3c8538a]: Update servermon to 3c8538a [16:07:42] !log akosiaris@tin Finished deploy [servermon/servermon@3c8538a]: Update servermon to 3c8538a (duration: 00m 02s) [16:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:11] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3878520 (10Marostegui) a:03Papaul Can we get a new disk for this host? Thanks! [16:08:58] (03PS4) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 [16:19:05] (03Abandoned) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 (owner: 10Ottomata) [16:19:11] (03Restored) 10Ottomata: Create profile::hadoop::apt_pin to ensure zookeeper is the correct version [puppet] - 10https://gerrit.wikimedia.org/r/402370 (owner: 10Ottomata) [16:19:18] (03Abandoned) 10Ottomata: Create cdh::zookeeper class and specify version [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402363 (owner: 10Ottomata) [16:21:08] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/394966 (owner: 10Elukey) [16:24:17] (03PS1) 10Alexandros Kosiaris: Rename the servermon::wmf role back to servermon [puppet] - 10https://gerrit.wikimedia.org/r/402375 [16:24:38] (03CR) 10Alexandros Kosiaris: [C: 032] Rename the servermon::wmf role back to servermon [puppet] - 10https://gerrit.wikimedia.org/r/402375 (owner: 10Alexandros Kosiaris) [16:24:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rename the servermon::wmf role back to servermon [puppet] - 10https://gerrit.wikimedia.org/r/402375 (owner: 10Alexandros Kosiaris) [16:29:03] !log akosiaris@tin Started deploy [servermon/servermon@cf88f3f]: Update servermon to 3c8538a [16:29:05] !log akosiaris@tin Finished deploy [servermon/servermon@cf88f3f]: Update servermon to 3c8538a (duration: 00m 02s) [16:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:00] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3878562 (10herron) >>! In T181906#3877211, @Dzahn wrote: > But..i saw there is this file called "legacy_mailing_lists" and the exim config for those se... [16:36:08] (03PS1) 10Alexandros Kosiaris: servermon: Update settings.py.erb [puppet] - 10https://gerrit.wikimedia.org/r/402380 [16:36:25] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] servermon: Update settings.py.erb [puppet] - 10https://gerrit.wikimedia.org/r/402380 (owner: 10Alexandros Kosiaris) [16:37:24] (03PS1) 10Elukey: profile::analytics::database::meta::backup_dest: allow labs dir perms [puppet] - 10https://gerrit.wikimedia.org/r/402382 (https://phabricator.wikimedia.org/T166248) [16:40:22] !log upgrade and restart labsdb1010 [16:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:31] 2 proxies will complain temporarilly [16:42:48] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:43:29] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:51:48] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [16:52:28] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [17:01:08] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.233 second response time [17:04:36] (03PS1) 10Giuseppe Lavagetto: graphite: reorganize roles, one role() call per node [puppet] - 10https://gerrit.wikimedia.org/r/402388 [17:04:38] (03PS1) 10Giuseppe Lavagetto: role::installserver: create meta-role for installserver [puppet] - 10https://gerrit.wikimedia.org/r/402389 [17:04:40] (03PS1) 10Giuseppe Lavagetto: site.pp: one role() call for iron.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/402390 [17:05:16] (03CR) 10Ottomata: [C: 031] profile::analytics::database::meta::backup_dest: allow labs dir perms [puppet] - 10https://gerrit.wikimedia.org/r/402382 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [17:11:47] (03CR) 10Herron: [C: 031] "> AFAIK with this config v3 agents will silently ignore the http_" [puppet] - 10https://gerrit.wikimedia.org/r/398484 (https://phabricator.wikimedia.org/T182585) (owner: 10Andrew Bogott) [17:13:58] 10Operations: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3878641 (10bd808) [17:14:01] 10Operations, 10MediaWiki-Vagrant, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review: php-luasandbox in Wikimedia's Stretch apt repo depends on php5 - https://phabricator.wikimedia.org/T183888#3878640 (10bd808) [17:29:24] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org (pilot instance) - https://phabricator.wikimedia.org/T180854#3878718 (10bd808) >>! In T180854#3877831, @Qgil wrote: > > After reading https://wikitech.wikimedia.org/wiki/Distribution_upgrades/jessie_s... [17:42:57] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3878734 (10Natuur12) I'm one of the list admins . Those spamfilters are ancient and likely haven't been updated in a long, long time. I deleted all thre... [17:43:09] (03PS9) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: add targetted upgrades script [puppet] - 10https://gerrit.wikimedia.org/r/398079 (https://phabricator.wikimedia.org/T181647) [17:46:10] 10Operations, 10ops-eqiad, 10DC-Ops: cp1066's DRAC not responding to SSH - https://phabricator.wikimedia.org/T184196#3878740 (10Cmjohnson) 05Open>03Resolved This did not need to be powered off. I was able to reset mgmt via the idrac using the racadmin racreset command. I verified using an ipmi command... [17:59:19] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3878760 (10Papaul) Dear Mr Papaul Tshibamba, Hewlett Packard Enterprise Reference Number: 5325864400 STATUS: Customer Self Repair Part has been shipped Part/s shipped: 653952-001 Part description: SPS-DRV... [17:59:32] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3878761 (10Papaul) p:05High>03Normal [18:11:55] !log otto@tin Started deploy [analytics/superset/deploy@990bc38]: Running superset with python3 (fingers crossed) [18:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:05] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3878780 (10Multichill) >>! In T181906#3878562, @herron wrote: > `Jan 04 21:25:18 2018 (668) bad regexp in bounce_matching_header line: Moderators-nl` A... [18:14:05] !log otto@tin Finished deploy [analytics/superset/deploy@990bc38]: Running superset with python3 (fingers crossed) (duration: 02m 11s) [18:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:58] PROBLEM - superset on thorium is CRITICAL: connect to address 10.64.53.26 and port 9080: Connection refused [18:15:18] !log otto@tin Started deploy [analytics/superset/deploy@990bc38]: Running superset with python3 (fingers crossed) [18:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:35] !log otto@tin Finished deploy [analytics/superset/deploy@990bc38]: Running superset with python3 (fingers crossed) (duration: 00m 19s) [18:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:58] RECOVERY - superset on thorium is OK: TCP OK - 0.000 second response time on 10.64.53.26 port 9080 [18:30:09] (03CR) 10Cmjohnson: [C: 032] Adding production dns labvirt1021/22 [dns] - 10https://gerrit.wikimedia.org/r/402361 (https://phabricator.wikimedia.org/T183937) (owner: 10Cmjohnson) [18:40:16] (03CR) 10Gergő Tisza: "Yeah, please deploy it and I'll update the code if the extra star is really needed. (Reverting probably would not remove the cron entry, y" [puppet] - 10https://gerrit.wikimedia.org/r/395694 (https://phabricator.wikimedia.org/T181107) (owner: 10Gergő Tisza) [18:42:40] 10Operations, 10Cloud-VPS, 10cloud-services-team: wikidumpparse is using 1.2TB of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183970#3878805 (10chasemp) >>! In T183970#3876440, @Dfko wrote: > Hi, I am looking around for the offending files to delete them, but it has been a long while s... [18:48:48] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3878818 (10herron) >>! In T181906#3878780, @Multichill wrote: > Are these mails dropped or can we expect a bunch of mails coming in because one of the m... [18:52:36] 10Operations, 10Data-Services, 10Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#3878839 (10madhuvishy) [18:52:40] 10Operations, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Reimage labstore1001 and labstore1002 for DRBD storage setup - https://phabricator.wikimedia.org/T158196#3878835 (10madhuvishy) 05Open>03Resolved We'll do the upgrade to stretch for all labstore servers as a separate ste... [18:52:43] 10Operations, 10Cloud-VPS, 10monitoring: remove cloud VPS project 'ganglia' - https://phabricator.wikimedia.org/T183917#3878840 (10Dzahn) Thanks Andrew. So yea, if this project also doesn't exist from your point of view then this ticket can be closed. [19:08:11] (03PS1) 10Ottomata: Use python3 async gthread workers for superset [puppet] - 10https://gerrit.wikimedia.org/r/402411 (https://phabricator.wikimedia.org/T182688) [19:09:45] (03PS2) 10Ottomata: Use python3 async gthread workers for superset [puppet] - 10https://gerrit.wikimedia.org/r/402411 (https://phabricator.wikimedia.org/T182688) [19:09:49] (03CR) 10Ottomata: [V: 032 C: 032] Use python3 async gthread workers for superset [puppet] - 10https://gerrit.wikimedia.org/r/402411 (https://phabricator.wikimedia.org/T182688) (owner: 10Ottomata) [19:16:38] (03CR) 10BryanDavis: pcc: Python3 compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/402119 (owner: 10BryanDavis) [19:22:39] 10Operations, 10Cloud-VPS, 10cloud-services-team: wikidumpparse is using 1.2TB of 5T available NFS misc storage - https://phabricator.wikimedia.org/T183970#3878904 (10notconfusing) Yes, I will investigate and delete to under 100GB by Monday January 8th 2018. Thanks, Max Klein [19:31:34] (03PS1) 10Cmjohnson: Add new labvirts to netboot and dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/402417 (https://phabricator.wikimedia.org/T183937) [19:32:37] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Ferm rules for labstore NFS hosts - https://phabricator.wikimedia.org/T165136#3878923 (10madhuvishy) Noting that I merged https://gerrit.wikimedia.org/r/353508 and applied profile::wmcs::nfs::ferm to the new dumps distribution servers labstore1006&7, and the fer... [19:42:55] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3878953 (10RobH) p:05Triage>03Normal [19:43:50] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3878970 (10RobH) a:03BBlack Assigning this to @bblack to advise on racking proposal & confirm where these should go. Please provide feedback and assign to @Cmjohnson for followup. [19:47:09] (03CR) 10RobH: [C: 031] Add new labvirts to netboot and dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/402417 (https://phabricator.wikimedia.org/T183937) (owner: 10Cmjohnson) [19:51:18] (03PS1) 10RobH: adding esanders to two groups [puppet] - 10https://gerrit.wikimedia.org/r/402420 [19:51:43] (03CR) 10jerkins-bot: [V: 04-1] adding esanders to two groups [puppet] - 10https://gerrit.wikimedia.org/r/402420 (owner: 10RobH) [19:52:05] (03PS1) 10Ottomata: Use shell username instead of ldap CN to authenticate with superset [puppet] - 10https://gerrit.wikimedia.org/r/402421 [19:52:08] (03PS2) 10RobH: adding esanders to two groups [puppet] - 10https://gerrit.wikimedia.org/r/402420 (https://phabricator.wikimedia.org/T184206) [19:52:55] (03CR) 10Ottomata: [C: 032] Use shell username instead of ldap CN to authenticate with superset [puppet] - 10https://gerrit.wikimedia.org/r/402421 (owner: 10Ottomata) [19:53:16] (03PS2) 10Cmjohnson: Add new labvirts to netboot and dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/402417 (https://phabricator.wikimedia.org/T183937) [19:53:26] 10Operations, 10Ops-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3878991 (10RobH) [19:53:49] 10Operations, 10ops-eqiad, 10Traffic: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3878995 (10RobH) Please note that the three business day wait for objections to be noted on the task will end on Tuesday, 2018-01-09. Barring objections, this can be merged by ops clinic duty on t... [19:54:01] (03CR) 10Cmjohnson: [C: 032] Add new labvirts to netboot and dhcpd [puppet] - 10https://gerrit.wikimedia.org/r/402417 (https://phabricator.wikimedia.org/T183937) (owner: 10Cmjohnson) [19:54:03] bleh [19:54:14] too many open tasks commenting on wrong tasks, yay for phab comment delete [19:54:35] 10Operations, 10Ops-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Ed Sanders - https://phabricator.wikimedia.org/T184206#3876092 (10RobH) p:05Triage>03Normal Please note that the three business day wait for objections to be noted on the task will end on Tuesday, 2018... [20:00:20] PROBLEM - Host wtp1016.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:00:20] PROBLEM - Host wtp1015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:00:20] PROBLEM - Host wtp1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [20:08:41] (03PS1) 10Ottomata: Fixes to better configure hadoop.proxyuser [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402424 [20:09:00] (03CR) 10jerkins-bot: [V: 04-1] Fixes to better configure hadoop.proxyuser [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402424 (owner: 10Ottomata) [20:09:51] (03CR) 10Legoktm: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [20:10:05] (03PS2) 10Ottomata: Fixes to better configure hadoop.proxyuser [puppet/cdh] - 10https://gerrit.wikimedia.org/r/402424 [20:12:40] RECOVERY - Host wtp1016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.87 ms [20:15:12] (03PS1) 10Ottomata: Allow superset to submit jobs to Hadoop as logged in users [puppet] - 10https://gerrit.wikimedia.org/r/402425 [20:15:32] (03CR) 10jerkins-bot: [V: 04-1] Allow superset to submit jobs to Hadoop as logged in users [puppet] - 10https://gerrit.wikimedia.org/r/402425 (owner: 10Ottomata) [20:16:09] RECOVERY - Host wtp1015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.84 ms [20:29:58] (03PS5) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/396282 (https://phabricator.wikimedia.org/T23582) [20:33:14] (03PS8) 10Tjones: Updates to enable short URLs for transliteration for crhwiki - beta [puppet] - 10https://gerrit.wikimedia.org/r/396283 (https://phabricator.wikimedia.org/T23582) [20:38:54] (03PS1) 10ArielGlenn: add scap keys for dumpsdeploy for beta [labs/private] - 10https://gerrit.wikimedia.org/r/402426 [20:43:48] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3879068 (10mmodell) [20:43:51] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-phab completely broken - https://phabricator.wikimedia.org/T184233#3879065 (10mmodell) 05Open>03Resolved a:03mmodell I deleted the instance [20:44:05] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3879069 (10Natuur12) Well, seems that this is resolved. (Though we need a new spamfilter.) Thank you so much for all the help Herron and Dzahn. [20:47:25] 10Operations, 10Wikimedia-Mailing-lists: Emails send by subscribers don't arrive on the mailing list Moderators-nl - https://phabricator.wikimedia.org/T181906#3879083 (10Dzahn) 05Open>03Resolved a:03Dzahn [20:47:29] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879086 (10Krenair) Nope, it just plain doesn't exist... [20:47:50] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.12 [keeping static files] (duration: 02m 11s) [20:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:27] Can someone get me "apt-cache policy prometheus-nutcracker-exporter" from a prod stretch videoscaler host? [20:50:38] Krenair that package exists for me [20:50:47] https://phabricator.wikimedia.org/P6545 [20:51:02] interesting. on which host was this? [20:51:09] git.eqiad.wmflabs [20:51:13] jenkins-slave-01 [20:51:48] thank you [20:51:58] your welcome :). [20:52:59] hmmmm [20:53:41] there are a bunch of warnings and errors from 'apt-get update' on at least two beta hosts where this seems to be a problem. could be related [20:54:23] works fine on this one you've found [20:54:38] Krenair: https://phabricator.wikimedia.org/P6546 [20:54:56] but then you don't have a /etc/apt/sources.list.d/wikimedia-experimental.list which seems to be a problem due to conflicts [20:55:08] mutante, thanks [20:55:15] rm /etc/apt/sources.list.d/wikimedia-experimental.list ? [20:55:26] there is no /etc/apt/sources.list.d/wikimedia-experimental.list on it [20:56:03] wikimedia.list and debian-backports.list [20:56:11] interesting [20:56:16] from modules/apt/manifests/init.pp [20:56:21] if $::operatingsystem == 'Debian' { [20:56:22] apt::repository { 'wikimedia-experimental': [20:56:22] ensure => $use_experimental_ensure, [20:57:08] now in prod, use_experimental_ensure is going to be true for some (not all) caches but not mw servers [20:57:09] [mw1259:/etc/apt/sources.list.d] $ facter | grep operatingsystem [20:57:09] operatingsystem => Debian [20:57:43] did someone set apt::use_experimental [20:57:46] Krenair ^^ [20:57:58] that's what I'm looking at [20:59:03] yep there it is - https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep [20:59:12] global across the whole project, nice [21:00:12] maybe our apt system is not set up to have an experimental thing on stretch, and this breaks apt-get and co [21:00:19] heh, yea. too many places to set Hiera? git repo, wiki page, horizon, wikitech [21:00:28] yeah I know [21:00:34] also locally on the puppetmaster :p [21:00:43] Krenair there's a experimental repo for stretch https://apt.wikimedia.org/wikimedia/dists/stretch-wikimedia/experimental/ [21:00:48] but wiki page is on wikitech [21:00:55] also what was the exact error you got from apt-get update please? [21:01:34] there's some warnings, but the errors: [21:01:35] E: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/stretch-wikimedia/InRelease Unable to find expected entry 'experimental/source/Sources' in Release file (Wrong sources.list entry or malformed file) [21:01:35] E: Some index files failed to download. They have been ignored, or old ones used instead. [21:03:04] so maybe it's not fully set up or something [21:03:30] Krenair ah [21:03:36] comparing both from jessie and stretch [21:03:48] results in stretch not doing [21:03:49] Components: main backports thirdparty experimental thirdparty/cloudera component/ci thirdparty/ci component/elastic55 thirdparty/elastic55 component/icu57 component/git [21:03:52] but jessie does. [21:04:10] yeah jessie has a bunch of entries in the InRelease file for experimental, unlike stretch [21:04:44] I'm going to remove the experimental sources file from one of the broken instances, run apt, see if I get the package I need, then see if I can prevent puppet from adding this to stretch instances where it appears to break [21:05:05] so is this about puppet failure on deployment prep? [21:05:14] yeah [21:05:19] due to packages failing to install [21:05:24] i was about to say.. so what's the goal.. remove experimental? [21:05:35] probably due to this error breaking apt-get update [21:05:38] we want to be like prod, right [21:05:41] fix all the things [21:05:45] Or we could add the experimental component to stretch. [21:05:50] the core issue is probably that we need btoh staging and testing [21:06:03] as the folder is already there, it just needs adding to that InRelease file. [21:06:07] yeah there it is [21:06:16] the instance sees the package it needed now [21:06:55] Krenair: if in doubt i would vote for "deployment-prep should be like prod" and experimental should be tested in another place [21:07:05] but.. also what releeng says [21:07:43] https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=1572238&oldid=1368443 [21:10:28] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879147 (10Krenair) Dug into this a bit more with som... [21:10:43] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879148 (10Paladox) It's due to the experimental comp... [21:10:49] (03PS1) 10Chad: Releases: Include all contint PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/402430 [21:11:04] (03CR) 10ArielGlenn: [V: 032 C: 032] add scap keys for dumpsdeploy for beta [labs/private] - 10https://gerrit.wikimedia.org/r/402426 (owner: 10ArielGlenn) [21:11:28] our stretch instances: (9) deployment-imagescaler02.deployment-prep.eqiad.wmflabs,deployment-kafka-jumbo-[1-2].deployment-prep.eqiad.wmflabs,deployment-mediawiki07.deployment-prep.eqiad.wmflabs,deployment-netbox.deployment-prep.eqiad.wmflabs,deployment-redis[05-06].deployment-prep.eqiad.wmflabs,deployment-snapshot01.deployment-prep.eqiad.wmflabs,deployment-videoscaler01.deployment-prep.eqiad.wmflabs [21:11:35] indeed [21:11:54] * apergos is proud to join the stretch club [21:12:02] prod roles using experimental apt: [21:12:02] hieradata/role/common/cache/canary.yaml:apt::use_experimental: true [21:12:03] hieradata/role/common/cache/misc.yaml:apt::use_experimental: true [21:12:26] wonder if any of those are stretch [21:12:33] Krenair https://github.com/wikimedia/puppet/blob/production/modules/aptrepo/files/distributions-wikimedia [21:13:23] so experimental isn't even listed under stretch there [21:16:22] no, experimental on stretch was removed, moritz reminded me of this today [21:18:13] apergos, do you think we should add an extra check in the apt module to ignore apt::use_experimental on stretch machines? [21:18:35] https://wikitech.wikimedia.org/wiki/APT_repository [21:18:43] well this is the structure now [21:19:12] are there any stretch instances that were spun off after that change was made [21:19:29] which change? [21:19:44] the change to the repo structure and [21:19:55] one hopes, a corresponding change in puppet manifests [21:20:03] *spun up [21:20:58] I mean the instant fix is to remove /etc/apt/sources.list.d/wikimedia-experimental on the whining instances [21:21:12] but the question is whether that's added in puppet still and/or in the base image still [21:21:15] for new instances [21:22:51] 10Operations, 10Cloud-VPS, 10DNS, 10Traffic, 10Beta-Cluster-reproducible: Create some mechanism for instances in projects to modify the project Designate records - https://phabricator.wikimedia.org/T184245#3877216 (10bd808) Related: * https://github.com/hanazuki/acmesmith-designate * {T173469} [21:24:39] (03Draft1) 10Paladox: aptrepo: Add experimental to stretch (distributions-wikimedia) [puppet] - 10https://gerrit.wikimedia.org/r/402431 (https://phabricator.wikimedia.org/T184239) [21:24:42] (03PS2) 10Paladox: aptrepo: Add experimental to stretch (distributions-wikimedia) [puppet] - 10https://gerrit.wikimedia.org/r/402431 (https://phabricator.wikimedia.org/T184239) [21:25:08] apergos, it's added by puppet, yeah [21:25:17] we have use_experimental globally across the project [21:25:23] Krenair https://gerrit.wikimedia.org/r/402431 [21:25:29] prod only has it on two roles, which may not have any stretch machines [21:25:52] paladox, as it was removed deliberately I don't think we should do that [21:25:57] don't add it back please [21:26:00] Oh [21:26:08] (03Abandoned) 10Paladox: aptrepo: Add experimental to stretch (distributions-wikimedia) [puppet] - 10https://gerrit.wikimedia.org/r/402431 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [21:26:21] I think we should make use_experimental get ignored when the instance is running stretch [21:27:56] Krenair coulden't we use one of thos var = var ? { '' => answer, default => } ? [21:28:21] yes the logic is entirely doable [21:28:53] I will write the paste. [21:29:09] it might end up being some && (dist == jessie || dist == trusty) check but the bigger question is whether it's the right solution [21:29:17] yeah I'm ok with that Krenair [21:29:34] ignore for stretch [21:29:58] in the end it's going to be ignore for stretch and later, I think [21:30:04] cool [21:30:28] that would be logical [21:36:02] Krenair apergos https://phabricator.wikimedia.org/P6547 [21:38:01] (03PS1) 10Gergő Tisza: Add DELETE to list of allowed methods for text varnish [puppet] - 10https://gerrit.wikimedia.org/r/402433 (https://phabricator.wikimedia.org/T182825) [21:38:07] (03Draft1) 10Paladox: apt: Do not use experimental on stretch [puppet] - 10https://gerrit.wikimedia.org/r/402432 [21:38:09] (03PS2) 10Paladox: apt: Do not use experimental on stretch [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) [21:38:12] Krenair ^^ [21:38:16] I was thinking more along the lines of if os_version('debian <= jessie') { instead of the current $::operatingsystem == 'Debian' check [21:38:43] I see. [21:39:25] it already silently ignores the experimental thing if you're on trusty, so having it ignore on debian stretch (where it doesn't exist) seems reasonable [21:39:36] (03PS3) 10Paladox: apt: Do not use experimental on stretch [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) [21:39:37] done [21:39:37] (03PS4) 10Madhuvishy: wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) (owner: 10BryanDavis) [21:39:44] could make it os_version('debian jessie') [21:42:28] (03CR) 10Madhuvishy: [C: 032] wmcs: Add database drop support to maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/402137 (https://phabricator.wikimedia.org/T181925) (owner: 10BryanDavis) [21:42:43] Krenair as a temp work around you could use horizion to prefix apply the hiera var globaly for certain hosts in deployment prep [21:43:04] I could but it'd be bad [21:46:44] eww no [21:47:12] yeah <= jessie seems good to me [21:52:08] (03CR) 10ArielGlenn: [C: 031] "Since this is the new repo structure going forward, makes sense to me." [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [21:53:01] paladox, actually come the thought of it I don't know if I have sufficient permissions in deployment-prep to do that there anymore, if I wanted to [21:54:19] (03CR) 10Alex Monk: [C: 031] apt: Do not use experimental on stretch [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [21:57:26] yep [22:08:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [22:18:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [22:21:24] ^^ the Zuul alarm was definitely transient. A spike of chained changes got sent to Gerrit [22:21:31] nothing to worry about [22:22:06] (03CR) 10Dzahn: [C: 031] "from a mail by Moritz: "Experimental has been removed from stretch-wikimedia already and for" [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [22:23:32] hashar: :)thx [22:23:51] (03CR) 10Alex Monk: [C: 031] "Cherry-picked the patch on deployment-puppetmaster02, ran puppet on affected instances, ran apt-get update on affected instances, ran pupp" [puppet] - 10https://gerrit.wikimedia.org/r/402432 (https://phabricator.wikimedia.org/T184239) (owner: 10Paladox) [22:27:31] hm, from the cumin docs [22:27:33] F:lsbdistid = Debian and analytics* selects all the hosts with hostname that starts with analytics that have Ubuntu as OS. [22:27:42] saying = Debian means select Ubuntu? [22:27:58] !log T184263 ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=eswiki --logwiki=metawiki "Mega849" "Mega809" [22:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:10] T184263: Global rename failure on account Mega809 - https://phabricator.wikimedia.org/T184263 [22:28:29] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879304 (10Krenair) Patch handl... [22:29:27] (03PS4) 10Dzahn: network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 [22:30:04] ^ this will fix puppet run for all roles setting up something on apache behind misc-web varnish [22:30:09] (in labs) [22:31:26] (03CR) 10Dzahn: [C: 032] network::constants: add fake CACHE_MISC for labs [puppet] - 10https://gerrit.wikimedia.org/r/402136 (owner: 10Dzahn) [22:35:06] (03PS1) 10Ladsgroup: Add test2wiki as a group1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402445 [22:36:22] (03PS2) 10Ladsgroup: Add test2wiki as a group1 wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/402445 (https://phabricator.wikimedia.org/T182326) [22:39:06] (03CR) 10Dzahn: "we now have a brandnew module called "httpd" which replaces the apache module used here and which lets us fix the -1 from jenkins-bot" [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) (owner: 10Dzahn) [22:39:58] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mediawiki07, deployment-imagescaler02, deployment-redis06, deployment-videoscaler01 due to prometheus exporter packages being missing in stretch - https://phabricator.wikimedia.org/T184239#3879347 (10Krenair) Actually th... [22:40:18] did you guys see this btw? https://phabricator.wikimedia.org/T153468 [22:43:24] had not seen that yet, heh [22:48:08] (03PS3) 10Dzahn: microsites: create research.wikimedia.org static page [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) [22:51:07] 10Operations, 10DNS, 10Traffic, 10Beta-Cluster-reproducible, 10Upstream: Ferm/DNS library weirdness on deployment-mediawiki boxes - https://phabricator.wikimedia.org/T153468#3879374 (10Krenair) Gave up waiting for that (it's been almost a year), sent a message anyway and it's been held for moderation. [23:03:31] 10Operations, 10DNS, 10Traffic, 10Beta-Cluster-reproducible, 10Upstream: Ferm/DNS library weirdness causing puppet errors on 12 deployment-prep instances - https://phabricator.wikimedia.org/T153468#3879417 (10Krenair) [23:04:27] (03PS1) 10ArielGlenn: add explicit name of dumps key id for scap [dumps/scap] - 10https://gerrit.wikimedia.org/r/402447 [23:05:20] (03CR) 10Thcipriani: [C: 031] add explicit name of dumps key id for scap [dumps/scap] - 10https://gerrit.wikimedia.org/r/402447 (owner: 10ArielGlenn) [23:06:21] (03CR) 10ArielGlenn: [V: 032 C: 032] add explicit name of dumps key id for scap [dumps/scap] - 10https://gerrit.wikimedia.org/r/402447 (owner: 10ArielGlenn) [23:07:44] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3879421 (10Dzahn) [23:09:22] (03PS4) 10Dzahn: microsites: create research.wikimedia.org static page [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) [23:12:40] (03PS5) 10Dzahn: microsites: create research.wikimedia.org static page [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) [23:22:00] mutante: Mind having a lookie at https://gerrit.wikimedia.org/r/402430? [23:22:02] 10Operations, 10DNS, 10Traffic, 10Beta-Cluster-reproducible, 10Upstream: Ferm/DNS library weirdness causing puppet errors on 12 deployment-prep instances - https://phabricator.wikimedia.org/T153468#3879428 (10Krenair) [23:22:05] Should be pretty easy :) [23:22:05] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3879427 (10Krenair) [23:23:08] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192864 (10demon) Is this really best as a tracking task or should we add it to the deployment-prep workboard column? The task by its nature is always gonna be... [23:23:09] no_justification: oh yea, i saw it earlier, was on phone.. yes [23:23:18] No worries ty <3 [23:23:41] (03PS2) 10Dzahn: Releases: Include all contint PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/402430 (owner: 10Chad) [23:24:34] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3879432 (10Krenair) It's fine with me if you want to move them all to a particular workboard column instead of a tracking task [23:24:41] (03CR) 10Dzahn: [C: 032] Releases: Include all contint PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/402430 (owner: 10Chad) [23:26:51] (03CR) 10Dzahn: "Notice: /Stage[main]/Contint::Packages::Php/Package[php7.0-gmp]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/402430 (owner: 10Chad) [23:27:00] no_justification: see paste ^ [23:28:02] Coolio, no errors [23:28:27] (03CR) 10Dzahn: [C: 04-1] "closer but not yet: http://puppet-compiler.wmflabs.org/9599/bromine.eqiad.wmnet/change.bromine.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) (owner: 10Dzahn) [23:28:41] no_justification: yep, no errors and also done on 2001 [23:31:02] 10Operations, 10DNS, 10Traffic, 10Beta-Cluster-reproducible, 10Upstream: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468#3879457 (10Krenair) [23:33:54] (03PS6) 10Dzahn: microsites: create research.wikimedia.org static page [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) [23:38:19] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [23:42:38] (03PS7) 10Dzahn: microsites: create research.wikimedia.org static page [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) [23:42:39] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3879497 (10RobH) [23:43:08] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3794643 (10RobH) I've tested everything and I'm not getting serial output on the port5 for the mr1, and port 6 for the atlas. I'm haing them check the port5 first, since its a critical item. [23:44:24] welcome back stashbot [23:46:01] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/9601/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) (owner: 10Dzahn) [23:46:08] (03PS8) 10Dzahn: microsites: create research.wikimedia.org static page [puppet] - 10https://gerrit.wikimedia.org/r/401597 (https://phabricator.wikimedia.org/T183916) [23:48:19] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10fullscreenorgId=1 [23:55:34] (03Abandoned) 10Dzahn: planet: move locales include out of module [puppet] - 10https://gerrit.wikimedia.org/r/402161 (owner: 10Dzahn)