[00:01:21] (03PS3) 10CRusnov: Add ganeti->netbox sync script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) [00:07:10] (03PS1) 10CRusnov: Update to upstream v2.5.7 tag. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492577 [00:33:39] 10Operations, 10Operations-Software-Development: Netbox: cable termination names report - https://phabricator.wikimedia.org/T216469 (10crusnov) Just a note the models are a bit confused here. The termination is identified in the cable, i'm not sure that there are ports without cables or if that is important, b... [01:14:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 30.77% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:21:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:31:06] PROBLEM - Disk space on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 473469 MB (5% inode=79%) [05:26:07] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10RyanSteinberg) 05Resolved→03Open I'm reopening this so someone can take a look at @toddleroux's... [05:38:32] (03CR) 10KartikMistry: WIP: Cron to run script to purge old CX drafts (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [05:38:50] (03PS6) 10KartikMistry: WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [05:39:26] (03CR) 10jerkins-bot: [V: 04-1] WIP: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [05:58:06] (03PS1) 10Marostegui: db-eqiad: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492596 [05:59:21] (03PS2) 10Marostegui: db-eqiad: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492596 [06:00:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492596 (owner: 10Marostegui) [06:01:35] (03Merged) 10jenkins-bot: db-eqiad: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492596 (owner: 10Marostegui) [06:02:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1104 for MySQL upgrade (duration: 00m 50s) [06:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:51] !log Stop MySQL on db1104 for mysql upgrade [06:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:33] (03CR) 10jenkins-bot: db-eqiad: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492596 (owner: 10Marostegui) [06:08:27] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492597 [06:11:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492597 (owner: 10Marostegui) [06:11:59] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492597 (owner: 10Marostegui) [06:13:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 after MySQL upgrade (duration: 00m 45s) [06:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:56] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492597 (owner: 10Marostegui) [06:20:08] (03PS1) 10Marostegui: db-eqiad.php: Repool db1104 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492598 [06:25:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Repool db1104 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492598 (owner: 10Marostegui) [06:27:05] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1104 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492598 (owner: 10Marostegui) [06:28:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1104 in API after MySQL upgrade (duration: 00m 45s) [06:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:56] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.167 second response time [06:30:09] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1104 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492598 (owner: 10Marostegui) [06:30:36] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:35:50] (03PS1) 10Marostegui: parsercachepurging.pp: Increase TTL from 24 days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/492599 (https://phabricator.wikimedia.org/T210992) [06:37:10] (03PS1) 10Marostegui: InitialiseSettings.php: Restore TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492600 (https://phabricator.wikimedia.org/T210992) [06:37:42] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.639 second response time [06:38:06] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [07:05:39] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492601 [07:07:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492601 (owner: 10Marostegui) [07:08:04] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492601 (owner: 10Marostegui) [07:09:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1104 after MySQL upgrade (duration: 00m 45s) [07:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:42] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492601 (owner: 10Marostegui) [07:15:47] 10Operations, 10Analytics: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) p:05Triage→03Normal [07:16:02] 10Operations, 10Analytics: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) [07:16:15] 10Operations, 10Analytics, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) [07:19:24] (03CR) 10Marostegui: mariadb: Add the option of postprocessing backups (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [07:30:07] 10Operations, 10Analytics, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) List of parent znodes in main-eqiad: ` [zk: localhost:2181(CONNECTED) 0] ls / [registry, brokers, zookeeper, yarn-leader-election, hadoop-ha, r... [07:30:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:31:44] 10Operations, 10Analytics, 10User-Elukey: Review znodes on Zookeeper cluster to possibly remove not-used data - https://phabricator.wikimedia.org/T216979 (10elukey) * Main codfw is less crowded and probably doesn't need a clean up: ` [zk: localhost:2181(CONNECTED) 0] ls / [burrow, kafka, zookeeper] [zk: lo... [07:35:24] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [07:35:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:35:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:37:20] seems to be a spike for graphite-labs [07:37:43] should recover soon [07:37:49] (I mean the alert) [07:40:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [07:40:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:46:52] (03PS2) 10Muehlenhoff: Explicitly install ruby-safe-yaml to fix Puppet Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/492346 (https://phabricator.wikimedia.org/T213546) [07:48:07] (03CR) 10Muehlenhoff: [C: 03+2] Explicitly install ruby-safe-yaml to fix Puppet Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/492346 (https://phabricator.wikimedia.org/T213546) (owner: 10Muehlenhoff) [08:03:25] 10Operations, 10ops-eqiad, 10serviceops, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10MoritzMuehlenhoff) a:03Cmjohnson [08:04:21] 10Operations, 10ops-eqiad, 10serviceops, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10MoritzMuehlenhoff) This server is still under warranty for another 6-7 weeks. [08:04:44] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group Muehlenhoff T211668 [08:24:49] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10MoritzMuehlenhoff) @Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485968/ removed the spare role from mw2151,but the host is still installed with role(spare) and puppet is failing:... [08:43:02] (03CR) 10Vgutierrez: [C: 03+2] gerrit: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492283 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [08:43:10] (03PS2) 10Vgutierrez: gerrit: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492283 (https://phabricator.wikimedia.org/T207389) [08:44:10] 10Operations, 10monitoring: google safe browsing icinga checks sporadic UNKNOWN due to 403 - https://phabricator.wikimedia.org/T216985 (10fgiunchedi) [08:48:24] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1020 - https://phabricator.wikimedia.org/T214778 (10fgiunchedi) @Cmjohnson sounds good, let me know when you are ready to go and I'll poweroff the host. [08:49:09] <_joe_> !log generating mcrouter certificate for mw2151 T192457 [08:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:12] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [08:52:03] !log Deploy schema change on s2 on codfw master - lag will happen on s2 codfw - T187295 [08:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:07] T187295: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 [08:56:07] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Joe) >>! In T192457#4979764, @MoritzMuehlenhoff wrote: > @Dzahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485968/ removed the spare role from mw2151,but the host is still installed with... [08:58:07] (03CR) 10Jcrespo: [C: 03+1] parsercachepurging.pp: Increase TTL from 24 days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/492599 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [08:58:30] (03CR) 10Jcrespo: [C: 03+1] InitialiseSettings.php: Restore TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492600 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [09:00:29] (03CR) 10Jcrespo: ">" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:01:02] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [09:01:52] (03CR) 10Jcrespo: "> >" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:02:15] (03CR) 10Marostegui: "> >" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:02:40] (03PS2) 10Marostegui: parsercachepurging.pp: Increase TTL from 24 days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/492599 (https://phabricator.wikimedia.org/T210992) [09:02:52] (03PS2) 10Marostegui: InitialiseSettings.php: Restore TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492600 (https://phabricator.wikimedia.org/T210992) [09:03:19] (03CR) 10Jcrespo: "> > >" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:04:35] (03CR) 10Marostegui: [C: 03+2] parsercachepurging.pp: Increase TTL from 24 days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/492599 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [09:04:54] (03CR) 10Marostegui: [C: 03+2] InitialiseSettings.php: Restore TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492600 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [09:05:52] (03Merged) 10jenkins-bot: InitialiseSettings.php: Restore TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492600 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [09:07:26] !log marostegui@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Increase ParserCache TTL from 24 days to 30 - T210992 (duration: 00m 46s) [09:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:29] T210992: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 [09:09:45] (03CR) 10Marostegui: "> > > >" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:10:11] (03CR) 10jenkins-bot: InitialiseSettings.php: Restore TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492600 (https://phabricator.wikimedia.org/T210992) (owner: 10Marostegui) [09:11:09] (03PS1) 10Muehlenhoff: Record extended MOU date for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/492609 [09:13:52] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU date for flemmerich [puppet] - 10https://gerrit.wikimedia.org/r/492609 (owner: 10Muehlenhoff) [09:15:46] PROBLEM - Check systemd state on mw2151 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:16:03] 10Operations, 10DBA, 10Patch-For-Review, 10Performance-Team (Radar): Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) I have increased the TTL back to 30 days. Going to monitor the graphs for a few days before closing this. [09:17:29] (03PS1) 10Muehlenhoff: Partly remove access for awight [puppet] - 10https://gerrit.wikimedia.org/r/492611 [09:18:21] (03CR) 10Jcrespo: "> > > > >" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:19:12] PROBLEM - nutcracker port on mw2151 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [09:19:28] (03CR) 10Muehlenhoff: [C: 03+2] Partly remove access for awight [puppet] - 10https://gerrit.wikimedia.org/r/492611 (owner: 10Muehlenhoff) [09:19:52] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [09:20:56] PROBLEM - nutcracker process on mw2151 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker [09:22:48] 10Operations, 10Thumbor, 10serviceops: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) [09:23:03] (03CR) 10Marostegui: [C: 03+1] "> > > > > >" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [09:25:06] (03CR) 10Vgutierrez: [C: 03+2] gerrit: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492284 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [09:25:13] (03PS3) 10Vgutierrez: gerrit: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492284 (https://phabricator.wikimedia.org/T207389) [09:34:58] PROBLEM - PHP7 rendering on mw2151 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - header X-Powered-By: PHP/7. not found on http://10.192.32.39:9005/w/health-check.php - 380 bytes in 0.001 second response time [09:35:10] <_joe_> uhm interesting [09:35:16] (03CR) 10Vgutierrez: [C: 03+2] icinga: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492285 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [09:35:29] (03PS2) 10Vgutierrez: icinga: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492285 (https://phabricator.wikimedia.org/T207389) [09:36:30] <_joe_> moritzm: ^^ installation of a new server failed, because of T216712 [09:36:31] T216712: Switch PHP 7.2 packages to an internal component - https://phabricator.wikimedia.org/T216712 [09:37:23] (03PS2) 10Muehlenhoff: postgresql: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/489631 [09:40:26] PROBLEM - mediawiki-installation DSH group on mw2151 is CRITICAL: Host mw2151 is not in mediawiki-installation dsh group [09:41:32] (03CR) 10Vgutierrez: [C: 03+2] icinga: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492286 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [09:41:39] (03PS2) 10Vgutierrez: icinga: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492286 (https://phabricator.wikimedia.org/T207389) [09:42:17] (03CR) 10Muehlenhoff: [C: 03+2] postgresql: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/489631 (owner: 10Muehlenhoff) [09:43:36] (03PS3) 10Vgutierrez: icinga: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492286 (https://phabricator.wikimedia.org/T207389) [09:44:18] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash: Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10Mathew.onipe) [09:44:36] 10Operations, 10Discovery-Search, 10Wikimedia-Logstash: Upgrade logstash plugins to 5.6.14 - https://phabricator.wikimedia.org/T216993 (10Mathew.onipe) p:05Triage→03High [09:45:11] <_joe_> I'll ack mw2151 alerts once I'm done with it, please be patient :) [09:45:50] RECOVERY - PHP7 rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 322 bytes in 0.006 second response time [09:47:13] yeah, should be failing because of the pcre package conflict [09:50:40] (03CR) 10Filippo Giunchedi: "LGTM overall" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [09:54:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "AFAICT this class is used by shinken too, which is still jessie (shinken-02.eqiad.wmflabs, +Giovanni)" [puppet] - 10https://gerrit.wikimedia.org/r/491460 (owner: 10Muehlenhoff) [09:54:49] !log Deploy schema change on db1074, this will generate lag on labsdb:s2 - T187295 [09:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:52] T187295: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 [09:55:13] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (+Daniel)" [puppet] - 10https://gerrit.wikimedia.org/r/491780 (owner: 10CDanis) [10:01:30] (03CR) 10Vgutierrez: [C: 04-1] "please check the comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [10:01:36] (03CR) 10Marostegui: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [10:03:50] (03PS1) 10Filippo Giunchedi: [WIP] Use mmjsonparse by default [puppet] - 10https://gerrit.wikimedia.org/r/492632 (https://phabricator.wikimedia.org/T213189) [10:07:50] (03PS1) 10Filippo Giunchedi: gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) [10:08:38] (03CR) 10jerkins-bot: [V: 04-1] gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [10:10:49] (03PS2) 10Filippo Giunchedi: gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) [10:11:47] (03CR) 10jerkins-bot: [V: 04-1] gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [10:13:47] (03PS1) 10Jbond: pin pdns-recursor to openstack-mitaka-jessie on jessie instead of jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/492635 [10:14:20] (03CR) 10jerkins-bot: [V: 04-1] pin pdns-recursor to openstack-mitaka-jessie on jessie instead of jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/492635 (owner: 10Jbond) [10:16:19] !log Depooling thumbor1001 to reimage - T214597 [10:16:22] (03CR) 10Filippo Giunchedi: "jenkins fails because of wmf-style: 10:11:44 modules/profile/manifests/gerrit/server.pp:12 wmf-style: Parameter 'log_host' of class 'profi" [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [10:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:23] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [10:19:17] (03PS2) 10Jbond: update pdns-recursor pin on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492635 [10:20:07] (03CR) 10DCausse: cloudelastic: Add cloudelastic configs (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:20:56] (03CR) 10Vgutierrez: [C: 03+2] librenms: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492288 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:21:04] (03PS2) 10Vgutierrez: librenms: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492288 (https://phabricator.wikimedia.org/T207389) [10:23:27] (03PS3) 10Filippo Giunchedi: gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) [10:24:01] (03CR) 10jerkins-bot: [V: 04-1] gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [10:26:51] (03CR) 10Vgutierrez: [C: 03+2] librenms: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492289 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:26:58] (03PS2) 10Vgutierrez: librenms: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492289 (https://phabricator.wikimedia.org/T207389) [10:27:24] (03PS4) 10Filippo Giunchedi: gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) [10:28:40] RECOVERY - Disk space on labstore1004 is OK: DISK OK [10:30:04] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T1030). [10:31:46] !log labstore1004 restarted nfsd and killed stuck rpc.mountd.real processed (T216988) [10:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:17] T216988: labstore1004 - DISK CRITICAL - free space: /srv/tools 115904 MB (1% inode=79%): - https://phabricator.wikimedia.org/T216988 [10:32:18] PROBLEM - toolschecker: toolsdb on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/toolsdb - 356 bytes in 60.076 second response time [10:32:23] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492636 (https://phabricator.wikimedia.org/T128546) [10:32:34] RECOVERY - toolschecker: toolsdb on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.103 second response time [10:32:59] 10Operations: Collate jessie-wikimedia/backports into jessie-wikimedia/main - https://phabricator.wikimedia.org/T167292 (10MoritzMuehlenhoff) 05Open→03Declined a:05MoritzMuehlenhoff→03None This doesn't seem useful at this point, it will simply vanish along with jessie installations being phased out. The... [10:33:03] 10Operations, 10Patch-For-Review: Restructure our internal repositories further - https://phabricator.wikimedia.org/T158583 (10MoritzMuehlenhoff) [10:33:49] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492636 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:14] (03CR) 10Muehlenhoff: [C: 03+1] update pdns-recursor pin on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492635 (owner: 10Jbond) [10:34:47] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492636 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:35:38] 10Operations, 10Traffic, 10HTTPS: Make sure that services available for NDA-only users are using strong TLS ciphersuites - https://phabricator.wikimedia.org/T217002 (10Vgutierrez) [10:35:50] 10Operations, 10Traffic, 10HTTPS: Make sure that services available for NDA-only users are using strong TLS ciphersuites - https://phabricator.wikimedia.org/T217002 (10Vgutierrez) p:05Triage→03Normal [10:37:48] (03CR) 10Vgutierrez: [C: 03+2] lists: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492292 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:37:57] (03PS3) 10Vgutierrez: lists: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492292 (https://phabricator.wikimedia.org/T207389) [10:39:14] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:492636| Bumping portals to master (T128546, T202497)]] (duration: 00m 46s) [10:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:19] T202497: Add fundraising appeal on Wikipedia portal page - https://phabricator.wikimedia.org/T202497 [10:39:20] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:40:01] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:492636| Bumping portals to master (T128546, T202497)]] (duration: 00m 46s) [10:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:15] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492636 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:43:26] (03CR) 10Vgutierrez: [C: 03+2] lists: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492293 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:43:37] (03PS3) 10Vgutierrez: lists: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492293 (https://phabricator.wikimedia.org/T207389) [10:49:15] (03CR) 10Vgutierrez: [C: 03+2] mirrors: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492295 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:49:23] (03PS3) 10Vgutierrez: mirrors: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492295 (https://phabricator.wikimedia.org/T207389) [10:52:40] (03PS1) 10Filippo Giunchedi: logstash: remove log4j input [puppet] - 10https://gerrit.wikimedia.org/r/492640 (https://phabricator.wikimedia.org/T213899) [10:53:32] (03CR) 10Vgutierrez: [C: 03+2] mirrors: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492296 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [10:53:40] (03PS3) 10Vgutierrez: mirrors: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492296 (https://phabricator.wikimedia.org/T207389) [10:54:16] godog: hi, you probaly want to wait for that to be removed gerrit side first. [10:54:22] *probably [10:54:55] I have this patch https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/490797/ to remove it [10:55:35] paladox: oh nice! I didn't know about that patch so I published https://gerrit.wikimedia.org/r/c/operations/puppet/+/492633 today [10:55:47] which is the same afaics [10:55:53] Ah ok [10:55:55] Heh [10:56:15] (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492641 [10:57:05] (03PS1) 10Effie Mouzeli: Upgrade all Thumbor servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/492642 (https://phabricator.wikimedia.org/T214597) [11:07:33] (03PS6) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:09:35] (03CR) 10Effie Mouzeli: [C: 03+2] Upgrade all Thumbor servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/492642 (https://phabricator.wikimedia.org/T214597) (owner: 10Effie Mouzeli) [11:10:15] (03PS2) 10Effie Mouzeli: Upgrade all Thumbor servers to stretch [puppet] - 10https://gerrit.wikimedia.org/r/492642 (https://phabricator.wikimedia.org/T214597) [11:10:17] (03PS7) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:11:49] (03PS8) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:13:10] (03PS9) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:13:38] 10Operations, 10monitoring: google safe browsing icinga checks sporadic UNKNOWN due to 403 - https://phabricator.wikimedia.org/T216985 (10Volans) There could be some throttling ongoing. Also from a very quick look at [1] we might be using an older API version/url... [1] https://developers.google.com/safe-bro... [11:19:02] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` thumbor1001.eqiad.wmnet ` The log can be found in... [11:19:38] !log Reimageing thumbor1001 - T214597 [11:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:43] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [11:20:50] (03CR) 10Volans: [C: 04-1] "The operations/software/netbox repository master has not been updated, although the tags are there, it has diverged from upstream because " [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492577 (owner: 10CRusnov) [11:21:27] (03PS1) 10Giuseppe Lavagetto: Improve logging of errors, remove spurious print statements [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492647 [11:21:29] (03PS1) 10Giuseppe Lavagetto: Do not try to pull/push if no registry is defined. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492648 [11:21:38] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [11:26:10] (03CR) 10Mathew.onipe: cloudelastic: Add cloudelastic configs (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:27:09] (03PS11) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [11:28:30] (03CR) 10Volans: "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [11:35:27] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I would +2 this, but don't have the right to do so in this repository." [dumps/dcat] - 10https://gerrit.wikimedia.org/r/484011 (owner: 10Hoo man) [11:39:11] (03PS10) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [11:42:46] (03CR) 10Vgutierrez: [C: 03+2] exim: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492310 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:43:04] (03PS3) 10Vgutierrez: exim: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492310 (https://phabricator.wikimedia.org/T207389) [11:45:04] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [11:46:46] (03CR) 10Vgutierrez: [C: 03+2] mail: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492311 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:46:53] (03PS3) 10Vgutierrez: mail: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492311 (https://phabricator.wikimedia.org/T207389) [11:49:37] !log rolling out intel-microcode 3.20180807a.2 on all jessie/stretch servers, tests on a number of previously unsupported servers with Westmere CPU were successful and I've verified that all other microcode files are identical compared to the current 3.20180807a.1 microcode [11:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:31] ACKNOWLEDGEMENT - Check systemd state on mw2151 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto Server just reimaging, depends on a few tickets to be fixed. [11:50:31] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2151 is CRITICAL: Host mw2151 is not in mediawiki-installation dsh group Giuseppe Lavagetto Server just reimaging, depends on a few tickets to be fixed. [11:50:31] ACKNOWLEDGEMENT - nutcracker port on mw2151 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused Giuseppe Lavagetto Server just reimaging, depends on a few tickets to be fixed. [11:50:31] ACKNOWLEDGEMENT - nutcracker process on mw2151 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker Giuseppe Lavagetto Server just reimaging, depends on a few tickets to be fixed. [11:50:31] ACKNOWLEDGEMENT - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 25 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[scap],Package[libpcre3-dbg] Giuseppe Lavagetto Server just reimaging, depends on a few tickets to be fixed. [11:51:15] (03CR) 10Vgutierrez: [C: 03+2] netbox: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492317 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:51:28] (03PS2) 10Vgutierrez: netbox: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492317 (https://phabricator.wikimedia.org/T207389) [11:55:32] (03CR) 10Vgutierrez: [C: 03+2] netbox: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492318 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [11:55:42] (03PS2) 10Vgutierrez: netbox: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492318 (https://phabricator.wikimedia.org/T207389) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T1200). [12:00:04] Zoranzoki21: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:21] I can SWAT today [12:00:27] Zoranzoki21: around for swat? [12:01:02] (03PS2) 10Zfilipin: Added and subdomains of mehrnews.com to wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [12:01:38] (03CR) 10Zfilipin: "Scheduled for EU SWAT but the developer was not around so it was not deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) (owner: 10Zoranzoki21) [12:01:55] (03CR) 10Vgutierrez: [C: 03+2] tendril: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492326 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:02:07] (03PS2) 10Vgutierrez: tendril: Switch from certcentral to acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/492326 (https://phabricator.wikimedia.org/T207389) [12:06:13] (03CR) 10Vgutierrez: [C: 03+2] tendril: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492327 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:06:21] (03PS2) 10Vgutierrez: tendril: Get rid of certcentral certificates [puppet] - 10https://gerrit.wikimedia.org/r/492327 (https://phabricator.wikimedia.org/T207389) [12:07:55] (03PS11) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [12:12:12] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [12:13:46] (03PS1) 10Vgutierrez: icinga: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492656 (https://phabricator.wikimedia.org/T217002) [12:16:13] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/14817/" [puppet] - 10https://gerrit.wikimedia.org/r/492656 (https://phabricator.wikimedia.org/T217002) (owner: 10Vgutierrez) [12:21:26] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [12:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:41] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 01m 15s) [12:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:36] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [12:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:41] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 05s) [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:41] (03PS12) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [12:34:35] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [12:36:26] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions - https://phabricator.wikimedia.org/T217020 (10elukey) p:05Triage→03Normal [12:36:31] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions) - https://phabricator.wikimedia.org/T217020 (10elukey) [12:38:27] !log gilles@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [12:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:34] !log gilles@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 07s) [12:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:55] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:40:39] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:40:39] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:40:47] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[intel-microcode] [12:50:05] RECOVERY - puppet last run on mc2028 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:50:47] RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:50:49] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:50:51] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:05:37] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Kanban (Doing), 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862 (10Jhernandez) [13:08:05] PROBLEM - puppet last run on cloudvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [13:08:43] PROBLEM - puppet last run on ms-fe1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [13:09:45] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [13:10:41] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [13:13:17] RECOVERY - puppet last run on cloudvirt1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:13:57] RECOVERY - puppet last run on ms-fe1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:14:55] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:14:57] <_joe_> moritzm: is that expected? [13:15:12] <_joe_> ^^ puppet failing trying to install initramfs-tools? [13:17:46] (03CR) 10DCausse: [C: 03+1] elasticsearch: upgrade rows one after the other [software/spicerack] - 10https://gerrit.wikimedia.org/r/492308 (owner: 10Gehel) [13:20:45] yeah, those are all puppet spam caused by the intel-microcode update (which rebuilds the initrd) [13:23:07] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Kanban (Doing), 10Services (doing): Requests to MW 404 when on HTTPS - https://phabricator.wikimedia.org/T202982 (10Jhernandez) [13:26:19] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:32:54] !log upgrade etherpad-lite to 1.7.5 [13:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:51] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [13:39:50] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Core Platform Team Kanban (Done with CPT), and 4 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456 (10CCicalese_WMF) [13:44:45] (03PS1) 10Giuseppe Lavagetto: Add golang image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492667 [13:45:40] (03PS3) 10WMDE-Fisch: Show referencePreviews on group0 wikis as beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) [13:49:42] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "Good to go when the dependency is merged." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491959 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [13:53:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492641 (owner: 10Marostegui) [13:54:00] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10Jhernandez) [13:54:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492641 (owner: 10Marostegui) [13:55:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1085 for MySQL upgrade and schema change (duration: 00m 46s) [13:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492641 (owner: 10Marostegui) [13:56:55] (03PS1) 10Volans: doc: fix reStructuredText formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/492671 [14:03:34] (03CR) 10Volans: [C: 03+2] doc: fix reStructuredText formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/492671 (owner: 10Volans) [14:03:59] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:04:00] !log Stop MySQL on db1085 for mysql upgrade [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:44] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['thumbor1001.eqiad.wmnet'] ` [14:07:36] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492674 [14:07:58] (03Merged) 10jenkins-bot: doc: fix reStructuredText formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/492671 (owner: 10Volans) [14:09:23] (03CR) 10jenkins-bot: doc: fix reStructuredText formatting [software/spicerack] - 10https://gerrit.wikimedia.org/r/492671 (owner: 10Volans) [14:13:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492674 (owner: 10Marostegui) [14:14:41] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492674 (owner: 10Marostegui) [14:15:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1085 after MySQL upgrade (duration: 00m 45s) [14:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:53] (03PS1) 10Filippo Giunchedi: [WIP] move varnish logging to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/492676 (https://phabricator.wikimedia.org/T213899) [14:18:34] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492674 (owner: 10Marostegui) [14:21:52] (03PS2) 10Filippo Giunchedi: [WIP] move varnish logging to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/492676 (https://phabricator.wikimedia.org/T213899) [14:27:56] (03PS3) 10Volans: remote: suppress Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/481858 (https://phabricator.wikimedia.org/T212783) [14:27:58] (03PS1) 10Marostegui: db-eqiad.php: Pool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492678 [14:28:20] (03PS4) 10Volans: remote: suppress Cumin's output [software/spicerack] - 10https://gerrit.wikimedia.org/r/481858 (https://phabricator.wikimedia.org/T212783) [14:29:17] (03CR) 10Volans: "This has been already used as a hotfix on cumin2001 during last Friday Elasticsearch upgrade cookbook run." [software/spicerack] - 10https://gerrit.wikimedia.org/r/481858 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [14:29:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Pool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492678 (owner: 10Marostegui) [14:30:56] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492678 (owner: 10Marostegui) [14:30:58] jouncebot: next [14:30:58] In 3 hour(s) and 29 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T1800) [14:31:09] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492678 (owner: 10Marostegui) [14:32:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool into API db1085 after MySQL upgrade (duration: 00m 45s) [14:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:23] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [14:38:58] (03CR) 10Fsero: "i've built it over the builder-envoy cloud VM. Could anyone vote for or against that so i can move forward?" [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) (owner: 10Fsero) [14:39:59] <_joe_> fsero: I'll get to review it soon, promised [14:40:16] thx _joe_ [14:43:32] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492680 [14:44:53] (03PS13) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [14:46:05] (03PS3) 10Filippo Giunchedi: [WIP] move varnish logging to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/492676 (https://phabricator.wikimedia.org/T213899) [14:46:58] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [14:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:17] !log jiji@deploy1001 deploy aborted: (no justification provided) (duration: 00m 19s) [14:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase weight for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492680 (owner: 10Marostegui) [14:48:58] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492680 (owner: 10Marostegui) [14:49:03] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [14:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:07] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [14:49:14] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492680 (owner: 10Marostegui) [14:49:18] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 15s) [14:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1085 after MySQL upgrade (duration: 00m 45s) [14:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:33] (03PS14) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [14:53:50] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492683 [14:57:21] (03PS15) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [14:57:58] (03PS1) 10Andrew Bogott: apache: update ssl config [wikitech-static] - 10https://gerrit.wikimedia.org/r/492684 [14:58:00] (03PS1) 10Andrew Bogott: mediawiki config: catch up with upstream mw changes [wikitech-static] - 10https://gerrit.wikimedia.org/r/492685 [14:59:06] (03PS2) 10Andrew Bogott: apache: update ssl config [wikitech-static] - 10https://gerrit.wikimedia.org/r/492684 [14:59:08] (03PS2) 10Andrew Bogott: mediawiki config: catch up with upstream mw changes [wikitech-static] - 10https://gerrit.wikimedia.org/r/492685 [15:01:49] (03CR) 10jerkins-bot: [V: 04-1] Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 (owner: 10Jbond) [15:01:51] (03CR) 10Andrew Bogott: "Daniel, I'm adding you on this patch just so you know that this repo exists :) It's slightly better than having the wikitech-static host " [wikitech-static] - 10https://gerrit.wikimedia.org/r/492684 (owner: 10Andrew Bogott) [15:02:05] (03CR) 10Ottomata: Add eventbus analytics logging alongside with kafka logging. (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [15:02:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492683 (owner: 10Marostegui) [15:03:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Some nitpick but overall a very good work." (036 comments) [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) (owner: 10Fsero) [15:03:58] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492683 (owner: 10Marostegui) [15:04:13] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492683 (owner: 10Marostegui) [15:04:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1085 after MySQL upgrade (duration: 00m 45s) [15:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:52] (03PS1) 10Vgutierrez: netbox: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492686 (https://phabricator.wikimedia.org/T217002) [15:06:54] (03PS1) 10Vgutierrez: librenms: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492687 (https://phabricator.wikimedia.org/T217002) [15:08:40] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Make sure that services available for NDA-only users are using strong TLS ciphersuites - https://phabricator.wikimedia.org/T217002 (10Vgutierrez) [15:09:45] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/492686 (https://phabricator.wikimedia.org/T217002) (owner: 10Vgutierrez) [15:10:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add citoid, cxserver kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/492273 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [15:10:32] (03PS2) 10Alexandros Kosiaris: Add citoid, cxserver kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/492273 (https://phabricator.wikimedia.org/T213194) [15:10:35] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add citoid, cxserver kubernetes tokens [puppet] - 10https://gerrit.wikimedia.org/r/492273 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [15:14:27] (03PS16) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [15:14:54] (03CR) 10Volans: "[meta review, CR is still in WIP]" (034 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [15:15:29] (03PS1) 10Marostegui: db-eqiad.php: Increase db1085 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492688 [15:16:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Increase db1085 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492688 (owner: 10Marostegui) [15:17:29] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1085 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492688 (owner: 10Marostegui) [15:17:42] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1085 API traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492688 (owner: 10Marostegui) [15:18:13] (03PS17) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [15:18:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase API traffic for db1085 after MySQL upgrade (duration: 00m 45s) [15:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:52] !log shutting down certcentral VMs for decommission - T207389 [15:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:55] T207389: Rename the Certcentral project to Acme-chief - https://phabricator.wikimedia.org/T207389 [15:20:57] (03CR) 10Hashar: "+1 on getting rid of print() statement, but maybe we can keep some kind of logging." (033 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492647 (owner: 10Giuseppe Lavagetto) [15:22:00] so long certcentral [15:22:40] yeah... renaming and up&running project is great fun [15:22:50] :-( [15:23:04] s/and/an/ [15:23:32] involved commits so far: https://gerrit.wikimedia.org/r/q/topic:%22T207389%22+(status:open%20OR%20status:merged) [15:24:43] :(( [15:26:44] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492691 [15:27:05] (03CR) 10Hashar: [C: 03+2] Do not try to pull/push if no registry is defined. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492648 (owner: 10Giuseppe Lavagetto) [15:27:13] (03PS5) 10Filippo Giunchedi: gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) [15:27:15] (03PS2) 10Filippo Giunchedi: logstash: remove log4j input [puppet] - 10https://gerrit.wikimedia.org/r/492640 (https://phabricator.wikimedia.org/T213899) [15:27:17] (03PS4) 10Filippo Giunchedi: [WIP] move varnish logging to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/492676 (https://phabricator.wikimedia.org/T213899) [15:27:19] (03PS1) 10Filippo Giunchedi: logstash: disable syslog-tls input, unused [puppet] - 10https://gerrit.wikimedia.org/r/492692 (https://phabricator.wikimedia.org/T213899) [15:27:51] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [15:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:54] !log jiji@deploy1001 deploy aborted: (no justification provided) (duration: 00m 04s) [15:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:01] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [15:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:08] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 07s) [15:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:29] jouncebot: now [15:28:29] No deployments scheduled for the next 2 hour(s) and 31 minute(s) [15:29:06] (03PS18) 10Jbond: Add password reset function to ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/492026 [15:30:30] !log akosiaris@deploy1001 scap-helm citoid install -n staging -f citoid-staging-values.yaml stable/citoid [namespace: citoid, clusters: staging] [15:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:32] !log akosiaris@deploy1001 scap-helm citoid cluster staging completed [15:30:32] !log akosiaris@deploy1001 scap-helm citoid finished [15:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:53] <_joe_> akosiaris: oh we're deploying citoid to production? [15:31:25] (03CR) 10Herron: [C: 03+1] logstash: disable syslog-tls input, unused [puppet] - 10https://gerrit.wikimedia.org/r/492692 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [15:33:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:33:41] (03PS3) 10Jbond: update pdns-recursor pin on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492635 [15:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:45] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [15:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:24] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:34:24] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [15:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:25] (03CR) 10Jbond: [C: 03+2] update pdns-recursor pin on jessie [puppet] - 10https://gerrit.wikimedia.org/r/492635 (owner: 10Jbond) [15:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] damn... my L8 is thick as a brick [15:35:11] I'll go with the old fashioned way [15:35:45] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:51] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [15:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492691 (owner: 10Marostegui) [15:37:37] (03CR) 10Hashar: "I don't think we need anymore threads. That being said, I would love to have a way to monitor the various Gerrit pools :-/" [puppet] - 10https://gerrit.wikimedia.org/r/489475 (owner: 10Paladox) [15:37:45] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492691 (owner: 10Marostegui) [15:37:58] (03PS3) 10Fsero: Initial debianization for envoyproxy [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) [15:39:06] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1085 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492691 (owner: 10Marostegui) [15:39:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool in API db1085 after MySQL upgrade (duration: 00m 45s) [15:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:28] (03CR) 10Fsero: [C: 03+2] "fixed your nitpicks" (033 comments) [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/491951 (https://phabricator.wikimedia.org/T215810) (owner: 10Fsero) [15:41:19] (03PS1) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T214608) [15:43:08] (03PS1) 10Elukey: hadoop: allow yarn rmstore to be stored on HDFS [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492697 (https://phabricator.wikimedia.org/T216952) [15:43:19] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 (10herron) [15:43:50] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/14824/" [puppet] - 10https://gerrit.wikimedia.org/r/492692 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [15:44:00] (03PS2) 10Filippo Giunchedi: logstash: disable syslog-tls input, unused [puppet] - 10https://gerrit.wikimedia.org/r/492692 (https://phabricator.wikimedia.org/T213899) [15:44:02] (03PS2) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [15:45:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14825/an-master1001.eqiad.wmnet/ - looks only a re-shape of the file." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492697 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [15:45:33] 10Operations, 10Patch-For-Review: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10herron) Setup of new hosts is complete. Tracking follow up steps in T213898 [15:45:39] 10Operations, 10Patch-For-Review: rack/setup/install logstash101[012].eqiad.wmnet - https://phabricator.wikimedia.org/T214608 (10herron) 05Open→03Resolved [15:46:02] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: disable syslog-tls input, unused [puppet] - 10https://gerrit.wikimedia.org/r/492692 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [15:49:43] herron: seems you upgrading logstash to stretch. Can we use the same opportunity to upgrade elasticsearch to 5.6.14? [15:50:47] (03PS2) 10Cwhite: prometheus: disable shipped node-exporter ipmitool and smartmon timers [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708) [15:53:23] hey onimisionipe, the upgrade to stretch is done now on the logstash hosts but can def help on the es upgrade [15:53:28] (03CR) 10Elukey: [C: 03+1] Install maven and ivysettings on all hadoop workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) (owner: 10Ottomata) [15:54:02] (03PS1) 10Vgutierrez: authdns: Drop certcentral hosts access (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/492705 (https://phabricator.wikimedia.org/T207389) [15:57:58] 10Operations, 10Proton, 10Core Platform Team Backlog (Watching / External), 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching): Proton fails with Chromium 72.0.3626.96 - https://phabricator.wikimedia.org/T216493 (10MoritzMuehlenhoff) >>! In T216493#4977440, @Tgr wrote: > There was some... [15:59:15] (03PS1) 10Mathew.onipe: nginx: move prometheus lua into lua dir [puppet/nginx] - 10https://gerrit.wikimedia.org/r/492711 (https://phabricator.wikimedia.org/T216681) [15:59:29] (03PS4) 10Mathew.onipe: tlsproxy: add prometheus option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) [16:00:39] (03PS12) 10Eevans: Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [16:00:49] herron: we have to build plugins first. that's a long process (from what I heard) [16:01:13] the upgrade will have to come later this week. you can proceed with your setup [16:01:25] ok sounds good [16:01:39] onimisionipe herron FYI I was talking with gehel re: logstash upgrade last week before he went on vacation, seems like we can at least upgrade logstash to 5.6 for sure [16:03:09] godog: OK [16:03:15] (03CR) 10Ppchelko: [C: 03+1] "Should we put this on SWAT for today?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 (owner: 10Ottomata) [16:04:28] (03PS1) 10Vgutierrez: authdns: Drop certcentral hosts access (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/492713 (https://phabricator.wikimedia.org/T207389) [16:05:08] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10jbond) [16:05:14] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) @fgiunchedi we can do this tomorrow if thats okay with you. Thanks. [16:05:14] godog: logstash upgrade still probably needs plugin upgrade. AFAIK, logstash checks the exact plugin version [16:05:22] 10Operations: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10jbond) [16:05:38] At least elasticsearch does, maybe logstash is more lax [16:05:48] * gehel is back to vacationing [16:06:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [16:06:28] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: icinga1001 crashed - https://phabricator.wikimedia.org/T214760 (10Cmjohnson) I forgot to update the task You have successfully submitted request SR986940908. [16:06:50] (03CR) 10Cwhite: [C: 03+2] prometheus: disable shipped node-exporter ipmitool and smartmon timers [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [16:06:57] (03PS3) 10Cwhite: prometheus: disable shipped node-exporter ipmitool and smartmon timers [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708) [16:07:11] gehel: indeed! thanks and enjoy your vacations :) [16:07:24] (03CR) 10Ottomata: "I guess! It would be nice to be able to test in beta :/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 (owner: 10Ottomata) [16:07:27] (03CR) 10Eevans: [C: 03+1] "Update [PC output](https://puppet-compiler.wmflabs.org/compiler1002/14829/). This still requires that production secrets be generated in " [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) (owner: 10Eevans) [16:07:46] (03PS1) 10Giuseppe Lavagetto: sudo: use validate_cmd [puppet] - 10https://gerrit.wikimedia.org/r/492718 [16:07:58] (03PS13) 10Eevans: Initial configuration for session storage service [puppet] - 10https://gerrit.wikimedia.org/r/487885 (https://phabricator.wikimedia.org/T215883) [16:10:50] RECOVERY - ensure kvm processes are running on cloudvirt1009 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 [16:11:41] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks fairly happy: https://puppet-compiler.wmflabs.org/compiler1001/14827/" [puppet] - 10https://gerrit.wikimedia.org/r/492705 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:11:56] (03PS2) 10Vgutierrez: authdns: Drop certcentral hosts access (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/492705 (https://phabricator.wikimedia.org/T207389) [16:13:40] (03CR) 10Herron: [C: 03+1] "Looks good to me -- I think best to merge after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/492633/" [puppet] - 10https://gerrit.wikimedia.org/r/492640 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [16:15:25] 10Operations, 10ops-eqiad, 10serviceops, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Cmjohnson) A self-dispatch ticket has been created for a new DIMM and CPU You have successfully submitted request SR986941367. [16:16:02] (03CR) 10Effie Mouzeli: [C: 03+1] Scap: upgrade to 3.9.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/491879 (https://phabricator.wikimedia.org/T216666) (owner: 10Thcipriani) [16:16:37] (03CR) 10Herron: [C: 03+1] gerrit: stop sending logs directly to logstash [puppet] - 10https://gerrit.wikimedia.org/r/492633 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [16:16:46] (03PS1) 10Mathew.onipe: thumbor: refer prometheus.lua from updated location [puppet] - 10https://gerrit.wikimedia.org/r/492720 (https://phabricator.wikimedia.org/T216681) [16:17:31] !log upload envoy 1.9.0 to stretch-wikimedia T215810 [16:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:35] T215810: Package envoy 1.9.0 for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 [16:19:09] (03CR) 10Vgutierrez: [C: 03+2] "pcc shows the expected clean up of the puppet catalog: https://puppet-compiler.wmflabs.org/compiler1002/14835/" [puppet] - 10https://gerrit.wikimedia.org/r/492713 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:19:23] (03PS2) 10Vgutierrez: authdns: Drop certcentral hosts access (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/492713 (https://phabricator.wikimedia.org/T207389) [16:20:11] (03PS5) 10Mathew.onipe: tlsproxy: add prometheus option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) [16:21:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/492718 (owner: 10Giuseppe Lavagetto) [16:21:45] (03CR) 10Mathew.onipe: tlsproxy: add prometheus option (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) (owner: 10Mathew.onipe) [16:21:58] !log Depooling and reimaging thumbor2001 - T214597 [16:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:01] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [16:22:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudvirt1015: apparent hardware errors in CPU/Memory - https://phabricator.wikimedia.org/T215012 (10Cmjohnson) Requested a new CPU from Dell You have successfully submitted request SR986941687. [16:23:48] !log reset 2fa for JBennett on phab with video confirmation [16:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:38] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` ['thumbor2001.codfw.wmnet'] ` The log can be foun... [16:30:23] (03PS3) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [16:30:46] (03PS4) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [16:32:11] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Cmjohnson) This will most like need a new motherboard. I requested one through Dell You have successfully submitted request SR986942076. [16:32:48] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed (HW memory issues) - https://phabricator.wikimedia.org/T214720 (10Marostegui) Excellent! Thank you Chris! [16:33:10] (03PS5) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [16:34:36] (03CR) 10Effie Mouzeli: [C: 03+2] Scap: upgrade to 3.9.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/491879 (https://phabricator.wikimedia.org/T216666) (owner: 10Thcipriani) [16:34:40] (03CR) 10Fsero: [C: 03+1] "nice one :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492667 (owner: 10Giuseppe Lavagetto) [16:34:49] (03PS2) 10Effie Mouzeli: Scap: upgrade to 3.9.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/491879 (https://phabricator.wikimedia.org/T216666) (owner: 10Thcipriani) [16:34:51] (03PS1) 10Volans: sre.hosts.decommission: don't fail on missing mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/492722 [16:37:42] (03CR) 10Vgutierrez: [C: 03+1] "LGTM! Thanks for the quick fix :D" [cookbooks] - 10https://gerrit.wikimedia.org/r/492722 (owner: 10Volans) [16:38:48] (03CR) 10Ayounsi: [C: 03+1] librenms: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492687 (https://phabricator.wikimedia.org/T217002) (owner: 10Vgutierrez) [16:38:50] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: don't fail on missing mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/492722 (owner: 10Volans) [16:40:21] (03PS6) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [16:40:26] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [16:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:30] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 04s) [16:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:18] (03Merged) 10jenkins-bot: sre.hosts.decommission: don't fail on missing mgmt [cookbooks] - 10https://gerrit.wikimedia.org/r/492722 (owner: 10Volans) [16:41:24] (03PS1) 10Vgutierrez: install_server: Get rid of certcentral instances [puppet] - 10https://gerrit.wikimedia.org/r/492724 (https://phabricator.wikimedia.org/T207389) [16:41:55] !log Pooling thumbor1001 [16:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:33] (03CR) 10Ayounsi: [C: 03+1] netbox: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492686 (https://phabricator.wikimedia.org/T217002) (owner: 10Vgutierrez) [16:43:38] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [16:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:43] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [16:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:19] volans: :_( [16:46:38] (03CR) 10Vgutierrez: [C: 03+2] netbox: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492686 (https://phabricator.wikimedia.org/T217002) (owner: 10Vgutierrez) [16:46:56] (03PS2) 10Vgutierrez: netbox: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492686 (https://phabricator.wikimedia.org/T217002) [16:48:25] !log thcipriani@deploy1001 Synchronized README: noop sync for scap 3.9.0-1 (duration: 00m 46s) [16:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:58] (03PS1) 10Volans: sre.hosts.decommission: fix catched exception [cookbooks] - 10https://gerrit.wikimedia.org/r/492728 [16:49:05] vgutierrez: ooops, old catched exception, fix here ^^^ [16:49:15] 10Operations: Audit our puppet tree for uses of jessie-backports - https://phabricator.wikimedia.org/T216711 (10bd808) [16:50:33] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [16:50:51] (03CR) 10Vgutierrez: [C: 03+2] librenms: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492687 (https://phabricator.wikimedia.org/T217002) (owner: 10Vgutierrez) [16:51:01] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [16:51:06] (03PS2) 10Vgutierrez: librenms: Make use of the strong TLS ciphersuites configuration [puppet] - 10https://gerrit.wikimedia.org/r/492687 (https://phabricator.wikimedia.org/T217002) [16:51:35] _joe_ jijiki: scap 3.9.0-1 upgrade looks normal from a scap log perspective. Thanks for the update. (although icinga isn't happy on some instances it seems) [16:52:04] thcipriani: race condition with puppet cron running the apt-get update [16:52:24] volans: ah, makes sense, thanks. [16:52:45] thcipriani: you can check the upgrade here ;) https://debmonitor.wikimedia.org/packages/scap [16:53:59] volans: TIL! that's really cool. [16:54:10] thcipriani: actually scratch what I just said... it was the puppet client failing to install scap=3.9.0-1 [16:54:18] oh good. [16:54:20] because unable to get the lock [16:54:41] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Make sure that services available for NDA-only users are using strong TLS ciphersuites - https://phabricator.wikimedia.org/T217002 (10Vgutierrez) [16:54:59] (03PS7) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [16:54:59] moritzm, jbond42: any fleet-wide debdeploy upgrade ongoing by any chance? [16:55:09] s/fleet/mw*/ [16:55:33] volans: sorry yes i should have logged, i was upgrading cups [16:55:40] its complet now [16:55:55] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Deploy scap 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10thcipriani) 05Open→03Resolved Thanks for the merge @jijiki ! [16:55:59] akc, no problem, that explains it :) [16:57:35] volans: nope, go ahead [16:57:43] (03CR) 10CRusnov: "> Patch Set 3:" (033 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/492007 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [16:58:05] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/14841/" [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [16:59:20] (03PS8) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [17:10:06] (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/492728 (owner: 10Volans) [17:11:21] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:15:51] RECOVERY - puppet last run on elastic2034 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:16:56] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix catched exception [cookbooks] - 10https://gerrit.wikimedia.org/r/492728 (owner: 10Volans) [17:18:42] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix catched exception [cookbooks] - 10https://gerrit.wikimedia.org/r/492728 (owner: 10Volans) [17:22:45] (03PS1) 10Alexandros Kosiaris: mathoid: Switch mathoid_requests_duration to a histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 [17:26:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [17:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:50] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [17:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:58] damn... what now? :) [17:27:00] :( [17:27:31] nice question... cause 2001 went as expected, not so far for 1001 [17:27:48] actually that's not true... [17:27:52] https://www.irccloud.com/pastebin/BZyLAxuM/ [17:31:14] ok so it actually run all but failed to update the task [17:32:20] orh right, {owner} and user=, refactor fail :( [17:36:00] (03CR) 10GTirloni: "Yes, that's right. And there's no Shinken in Stretch." [puppet] - 10https://gerrit.wikimedia.org/r/491460 (owner: 10Muehlenhoff) [17:37:39] (03PS1) 10Volans: sre.hosts.decommission: fix typo in string format [cookbooks] - 10https://gerrit.wikimedia.org/r/492737 [17:38:16] (03PS2) 10Alexandros Kosiaris: mathoid: Switch mathoid_requests_duration to a histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 [17:38:24] (03CR) 10Vgutierrez: [C: 03+1] "LGTM! last one hopefully :D" [cookbooks] - 10https://gerrit.wikimedia.org/r/492737 (owner: 10Volans) [17:39:26] (03PS3) 10Alexandros Kosiaris: Switch mathoid_requests_duration to a histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 [17:39:42] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix typo in string format [cookbooks] - 10https://gerrit.wikimedia.org/r/492737 (owner: 10Volans) [17:41:06] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix typo in string format [cookbooks] - 10https://gerrit.wikimedia.org/r/492737 (owner: 10Volans) [17:42:44] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [17:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:54] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [17:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:07] vgutierrez: thanks for testing it again, let's prove it's idempotent :D [17:43:18] yeah.. it fails every time [17:43:44] so.. two issues [17:43:48] Host %s already missing on Debmonitor [17:43:52] that %s doesn't look right [17:44:04] and... [17:44:10] https://www.irccloud.com/pastebin/dgI8STux/ [17:47:49] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber): Build blubber file for ORES - https://phabricator.wikimedia.org/T210268 (10thcipriani) [17:50:28] (03PS1) 10Volans: sre.hosts.decommission: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/492741 [17:50:32] vgutierrez: first fix ^^^ [17:50:45] the other is in spicerack, will follow shorthly [17:51:17] (03CR) 10Vgutierrez: [C: 03+1] sre.hosts.decommission: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/492741 (owner: 10Volans) [17:52:22] 10Operations, 10SRE-Access-Requests: Allow wmcs-roots full sudo on wmcs non-replica dbs - https://phabricator.wikimedia.org/T217065 (10bd808) [17:52:38] (03PS2) 10BryanDavis: wmcs: Allow wmcs-roots full sudo on wmcs non-replica dbs [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) [17:52:43] (03CR) 10RobH: "Please note that this was provisionally approved in the SRE weekly meeting. The only blocker is filing a #sre-access-request in phabricat" [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) (owner: 10BryanDavis) [17:53:00] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10Cmjohnson) @ayounsi I want to do all the server moves on Thursday this week. Can you ask the service owners to have everything depooled. I will get started at 1500 UTC. The server move will t... [17:53:06] i like the new gerrit ui so much better than old. [17:53:50] (03CR) 10Vgutierrez: [C: 03+2] install_server: Get rid of certcentral instances [puppet] - 10https://gerrit.wikimedia.org/r/492724 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [17:53:58] robh and it's changed again :) [17:54:12] huzzah [17:54:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow wmcs-roots full sudo on wmcs non-replica dbs - https://phabricator.wikimedia.org/T217065 (10bd808) a:03RobH [17:54:26] ok, reviewing it now! [17:54:39] (03PS2) 10Vgutierrez: install_server: Get rid of certcentral instances [puppet] - 10https://gerrit.wikimedia.org/r/492724 (https://phabricator.wikimedia.org/T207389) [17:54:41] (03CR) 10RobH: [C: 03+2] wmcs: Allow wmcs-roots full sudo on wmcs non-replica dbs [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) (owner: 10BryanDavis) [17:54:41] robh something like https://gerrit-review.googlesource.com/c/gerrit/+/214432 :) [17:54:47] (03PS3) 10RobH: wmcs: Allow wmcs-roots full sudo on wmcs non-replica dbs [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) (owner: 10BryanDavis) [17:54:54] (03PS1) 10Volans: debmonitor: fix missing variable for logging line [software/spicerack] - 10https://gerrit.wikimedia.org/r/492743 [17:55:00] vgutierrez: ^^^ [17:55:13] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/492741 (owner: 10Volans) [17:55:16] paladox: yeah, but even the newest is nicer than the old stuff [17:55:23] yup [17:55:25] i just like the unified review screens a lot better than the old unified screens [17:56:18] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow wmcs-roots full sudo on wmcs non-replica dbs - https://phabricator.wikimedia.org/T217065 (10jcrespo) a:05RobH→03jcrespo There is one possibility of privilege escalation, labsdb used to have a shared root password with production. This was cor... [17:56:34] (03PS4) 10RobH: wmcs: Allow wmcs-roots full sudo on wmcs non-replica dbs [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) (owner: 10BryanDavis) [17:56:42] (03CR) 10Vgutierrez: [C: 03+1] debmonitor: fix missing variable for logging line [software/spicerack] - 10https://gerrit.wikimedia.org/r/492743 (owner: 10Volans) [17:56:44] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/492741 (owner: 10Volans) [17:56:55] volans: thx for the quick fixes :D [17:57:05] thank you for the testing [17:57:35] I'm resisting the urge of quoting Pulp Fiction now [17:57:39] as the spicerack one will need a new release and deploy, in case you wanna retry one last time I guess you can go ahead also without the fixed logging [17:57:50] sure, one sec [17:58:02] (run puppet first ofc) [17:58:07] as I didn't [17:58:08] already on it.. [17:58:08] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10fgiunchedi) [17:58:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [17:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [17:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:40] \o/ [17:58:43] \o/ [17:58:47] (03CR) 10Ottomata: Add eventbus analytics logging alongside with kafka logging. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [17:58:51] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow wmcs-roots full sudo on wmcs non-replica dbs - https://phabricator.wikimedia.org/T217065 (10RobH) That is fine by me (Jaime also pinged me in irc to ensure I saw this, which is GREATLY appreciated since I hadn't seen the above comment!) This is... [17:58:58] did you pass a task to it? [17:59:03] * vgutierrez dances around the laptop [17:59:06] ahahah [17:59:28] volans: of course: https://phabricator.wikimedia.org/T207389#4981875 [17:59:30] lovely comment <3 [17:59:35] !log Depooling and reimaging thumbor1002 to stretch - T214597 [17:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:38] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [18:00:03] nice! :D [18:00:04] gehel and onimisionipe: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T1800). [18:00:33] commanded. really? [18:00:57] lol [18:01:00] haha wth [18:02:04] (03CR) 10Alexandros Kosiaris: [C: 03+1] Do not try to pull/push if no registry is defined. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492648 (owner: 10Giuseppe Lavagetto) [18:04:06] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@4c27682]: New GUI, Updater & Blazegraph builds [18:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:23] (03CR) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [18:05:25] (03PS1) 10Vgutierrez: certcentral: wipe certcentral from our puppet codebase [puppet] - 10https://gerrit.wikimedia.org/r/492744 (https://phabricator.wikimedia.org/T207389) [18:07:23] (03CR) 10Vgutierrez: [C: 03+1] "LGTM for me. Alex Monk, could you take care of certcentral mentions in hieradata/labs/deployment-prep/common.yaml and let me know when thi" [puppet] - 10https://gerrit.wikimedia.org/r/492744 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:08:00] Krenair: ^^ take a look to that whenever possible please :D [18:08:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] Improve logging of errors, remove spurious print statements (032 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492647 (owner: 10Giuseppe Lavagetto) [18:09:09] (03CR) 10Volans: [C: 03+2] debmonitor: fix missing variable for logging line [software/spicerack] - 10https://gerrit.wikimedia.org/r/492743 (owner: 10Volans) [18:10:42] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2001.codfw.wmnet'] ` and were **ALL** successful. [18:13:09] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [18:13:11] (03CR) 10Jcrespo: [C: 03+1] "I reviewed existing root passwords on the database, it needed a small correction, it is now ok to go on." [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) (owner: 10BryanDavis) [18:13:25] (03Merged) 10jenkins-bot: debmonitor: fix missing variable for logging line [software/spicerack] - 10https://gerrit.wikimedia.org/r/492743 (owner: 10Volans) [18:13:59] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@4c27682]: New GUI, Updater & Blazegraph builds (duration: 09m 53s) [18:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:09] (03CR) 10RobH: [C: 03+2] wmcs: Allow wmcs-roots full sudo on wmcs non-replica dbs [puppet] - 10https://gerrit.wikimedia.org/r/490948 (https://phabricator.wikimedia.org/T217065) (owner: 10BryanDavis) [18:14:26] (03CR) 10jenkins-bot: debmonitor: fix missing variable for logging line [software/spicerack] - 10https://gerrit.wikimedia.org/r/492743 (owner: 10Volans) [18:15:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Allow wmcs-roots full sudo on wmcs non-replica dbs - https://phabricator.wikimedia.org/T217065 (10RobH) 05Open→03Resolved a:05jcrespo→03None IRC sync update: Jaime put a +2 on patch and let me know this was good to go via irc. Patch is now liv... [18:16:54] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [18:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:03] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10crusnov) [18:18:04] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 01m 09s) [18:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:34] !log Pooling thumbor2001 [18:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:52] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [18:21:34] * Krinkle preparing to stage on mwdebug1002 [18:25:44] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` thumbor1002.eqiad.wmnet ` The log can be found in... [18:29:26] PROBLEM - nutcracker port on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:29:36] PROBLEM - MD RAID on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:29:36] PROBLEM - configured eth on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:29:44] PROBLEM - Memcached on thumbor1002 is CRITICAL: connect to address 10.64.16.57 and port 11211: Connection refused [18:29:48] PROBLEM - Disk space on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:29:52] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:02] PROBLEM - Check size of conntrack table on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:08] PROBLEM - DPKG on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:14] PROBLEM - haproxy process on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:16] PROBLEM - dhclient process on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:16] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:26] PROBLEM - Check whether ferm is active by checking the default input chain on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:30] PROBLEM - haproxy alive on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:30:36] ah this is a reimage [18:31:28] PROBLEM - puppet last run on thumbor1002 is CRITICAL: connect to address 10.64.16.57 port 5666: Connection refused [18:32:19] 10Operations, 10Elasticsearch, 10Traffic, 10Discovery-Search (Current work), 10Patch-For-Review: Enable nginx prometheus metrics for all elastic nodes - https://phabricator.wikimedia.org/T216681 (10Mathew.onipe) [18:33:33] my bad sorryyy [18:33:48] damn I knew I was missing something [18:36:40] elukey: staging now on mwdebug1002 to make sure it all works fine. Waiting for your signal before rolling out. [18:38:20] Krinkle: green light from me, whenever you want [18:40:20] vgutierrez, I think the deployment-prep certcentral stuff is dead [18:40:33] I still haven't gotten around for writing designate integration code [18:41:07] (03CR) 10Alex Monk: "20<Krenair>30 vgutierrez, I think the deployment-prep certcentral stuff is dead" [puppet] - 10https://gerrit.wikimedia.org/r/492744 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [18:41:11] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Mholloway) @MoritzMuehlenhoff It's an external blocker. We're blocke... [18:43:26] (03PS6) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [18:43:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [18:44:53] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.18/includes/libs/objectcache/WANObjectCache.php: 79a1593cae48 / T203786 (duration: 00m 48s) [18:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:58] T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [18:45:30] Krenair: ok.. I just wanted to avoid messing with anything you could have running there [18:46:05] elukey: done [18:47:08] Krinkle: thanks! watching metrics now [18:51:14] Krinkle: I'll follow up in the task tomorrow with all the metrics, let's hope this is the last one! :) [19:00:05] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T1900) [19:00:05] Smalyshev, bmansurov, Zoranzoki21, Pchelolo, tgr, and stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:08] here [19:01:10] I've actually removed myself from SWAT, more testing needed. [19:01:16] (03PS2) 10Smalyshev: [BETA] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490647 (owner: 10Jforrester) [19:01:37] won't be around during part of the SWAT but the patch cannot be tested anyway (I'll test it later by running the rename repair script, but that takes long) [19:02:02] here [19:02:03] so can be merged without oversight IMO [19:03:10] * Krinkle staging on deploy1001 [19:03:10] hi [19:03:15] SMalyshev: give me a few minutes. [19:03:35] sure, no rush... I'm in the meetings anyway so if somebody wants to go before me it's fine [19:03:53] meeting running longer than expected... [19:05:26] Krinkle: btw, this is for beta... is beta also served from deploy1001? [19:05:33] SMalyshev: nope. [19:05:35] So go ahead. [19:05:37] I thought there's some parallel thing [19:05:51] just don't do any git pull or sync on deploy1001 prod [19:06:19] ok give me a minute [19:06:58] RECOVERY - Memcached on thumbor1002 is OK: TCP OK - 0.037 second response time on 10.64.16.57 port 11211 [19:07:20] RECOVERY - Check size of conntrack table on thumbor1002 is OK: OK: nf_conntrack is 0 % full [19:07:30] RECOVERY - DPKG on thumbor1002 is OK: All packages OK [19:07:34] RECOVERY - haproxy process on thumbor1002 is OK: PROCS OK: 2 processes with command name haproxy [19:07:36] RECOVERY - Check whether ferm is active by checking the default input chain on thumbor1002 is OK: OK ferm input default policy is set [19:07:38] SMalyshev: beta is pulled from master by puppet, but you still need to pull it to deploy1001 otherwise the next deployer will be confused [19:07:42] RECOVERY - nutcracker port on thumbor1002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [19:07:53] tgr: ok [19:07:57] Pchelolo: ^^ are we too late? [19:08:00] .. after I'm done :) [19:08:06] OK. I'm done now. [19:08:10] * Krinkle unlocks mwdebug1002 [19:09:30] RECOVERY - Disk space on thumbor1002 is OK: DISK OK [19:09:36] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [19:09:40] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [19:10:22] RECOVERY - configured eth on thumbor1002 is OK: OK - interfaces up [19:10:24] RECOVERY - MD RAID on thumbor1002 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:10:59] btw, petr and I would like to do our SWAT after all [19:11:03] adding us back in the wiki [19:11:05] we are here [19:12:38] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) As of 2019-02-25 @ 19:12 there are no memory errors logged post dimm slot swap. [19:12:41] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) As of 2019-02-25 @ 19:12 there are no memory errors logged post dimm slot swap. [19:14:30] (03PS7) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [19:14:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:15:04] SMalyshev: who's running the swat? (sorry haven't really done many swats before) [19:15:19] Krinkle I think [19:15:19] RECOVERY - haproxy alive on thumbor1002 is OK: OK check_alive uptime 344s [19:15:23] ah [19:15:55] K Krinkle i put ours back on the bottom of the list. I'd like to just test one thing once it is on a mwdebug server [19:22:12] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) 05Stalled→03Open Unstalling as the RFC requested above is now approved! (see: T213318) How are we going to proceed with... [19:25:41] how goes swat? should I check in with someone? [19:26:28] (03PS8) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [19:26:54] I think Krinkle was just scratching his own itch ;) [19:26:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:27:15] it's buseh [19:27:23] k... [19:27:26] Yeah, go ahead, not me :) [19:27:30] (03CR) 10Reedy: [C: 03+2] [BETA] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490647 (owner: 10Jforrester) [19:27:31] oh i can do mine? [19:27:36] oh e.g. mwdebug1001? [19:28:33] * Reedy starts merging stuff [19:28:38] (03Merged) 10jenkins-bot: [BETA] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490647 (owner: 10Jforrester) [19:28:55] (03CR) 10jenkins-bot: [BETA] Enable WikibaseCirrusSearch on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490647 (owner: 10Jforrester) [19:29:06] (03PS3) 10Reedy: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [19:29:09] (03PS9) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [19:29:38] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:29:57] 10Operations, 10Operations-Software-Development, 10serviceops, 10User-Joe, 10User-jijiki: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10crusnov) Is it okay to use rapi for this or is there a compelling reason to use cumin+ganeti-* commands? [19:30:29] (03CR) 10Reedy: Enable logging for CitationUsage and CitationUsagePageLoad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [19:31:39] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: labs! (duration: 00m 46s) [19:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:49] (03CR) 10Bmansurov: Enable logging for CitationUsage and CitationUsagePageLoad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [19:32:19] Reedy: ^ [19:33:08] I know that *should* work, but does it actually? [19:33:16] We tend to set defaults for ease/simplicity [19:33:33] Reedy: it wored the last time around. Let me pull the patch. [19:33:34] (03CR) 10Jcrespo: "Probably dump_section.py should be renamed to backup_db.py (as it can be used for more things than dumps) and recover_section.py to recove" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [19:33:48] Thanks. If we've got an example in IS of it being fine [19:34:44] Krinkle: so I see it's still not deployed on beta commons... probably takes time? [19:34:58] Reedy: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/461117/3/wmf-config/InitialiseSettings.php [19:34:59] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.18/extensions/Renameuser: T215107 (duration: 00m 46s) [19:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:03] T215107: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 [19:35:25] SMalyshev: Should be there within 15 minutes on beta usually [19:35:47] Reedy: ok thanks, will be patiently waiting :) [19:36:30] (03CR) 10Reedy: [C: 03+2] Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [19:36:44] I think jerkins should comment when it's deployed [19:37:17] (03CR) 10Vgutierrez: [C: 03+2] Make lvs5003 peer with cr2-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/490525 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [19:37:20] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/239254/console [19:37:22] It's scapping [19:37:27] (03Merged) 10jenkins-bot: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [19:37:30] (03PS3) 10Vgutierrez: Make lvs5003 peer with cr2-eqsin [puppet] - 10https://gerrit.wikimedia.org/r/490525 (https://phabricator.wikimedia.org/T213121) (owner: 10Ayounsi) [19:38:46] (03CR) 10jenkins-bot: Enable logging for CitationUsage and CitationUsagePageLoad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492344 (https://phabricator.wikimedia.org/T213969) (owner: 10Bmansurov) [19:39:47] Reedy: when we get to mine, can we put it on mwdebug1001 for just a sec? want to just make an edit and verify events. [19:40:27] (03PS3) 10Reedy: Disable MFSpecialCaseMainPage for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863) (owner: 10Zoranzoki21) [19:40:48] (03CR) 10Reedy: [C: 03+2] Disable MFSpecialCaseMainPage for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863) (owner: 10Zoranzoki21) [19:40:56] (03PS1) 10Cwhite: admin: add Delphine Menard to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/492754 (https://phabricator.wikimedia.org/T216120) [19:41:51] !log restarting pybal on lvs5003 - T213121 [19:41:54] Aye [19:41:54] (03Merged) 10jenkins-bot: Disable MFSpecialCaseMainPage for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863) (owner: 10Zoranzoki21) [19:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:54] T213121: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 [19:42:07] (03PS2) 10Reedy: Disable MFSpecialCaseMainPage for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492432 (https://phabricator.wikimedia.org/T216865) (owner: 10Zoranzoki21) [19:42:11] (03CR) 10Reedy: [C: 03+2] Disable MFSpecialCaseMainPage for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492432 (https://phabricator.wikimedia.org/T216865) (owner: 10Zoranzoki21) [19:42:24] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10mobrovac) >>! In T216521#4982048, @Mholloway wrote: > @MSantos As a... [19:43:16] (03Merged) 10jenkins-bot: Disable MFSpecialCaseMainPage for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492432 (https://phabricator.wikimedia.org/T216865) (owner: 10Zoranzoki21) [19:43:54] Reedy: is the patch ready for testing in any of the mwdebug servers? [19:44:08] (03PS5) 10Reedy: GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:44:13] (03CR) 10Reedy: [C: 03+2] GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:45:21] (03Merged) 10jenkins-bot: GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:46:44] !log reedy@deploy1001 Synchronized dblists/mobilemainpagelegacy.dblist: Disable MFSpecialCaseMainPage for srwiki and enwikivoyage (duration: 00m 46s) [19:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:02] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor1002.eqiad.wmnet'] ` and were **ALL** successful. [19:47:53] bmansurov: mwdebug1002 [19:48:25] RECOVERY - puppet last run on thumbor1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:48:31] stephanebisson: ^ if you want to test too [19:49:06] Reedy: Yep, on mwdebug1002? [19:49:12] Yup [19:49:19] testing now... [19:50:37] (03CR) 10jenkins-bot: Disable MFSpecialCaseMainPage for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863) (owner: 10Zoranzoki21) [19:50:40] (03CR) 10jenkins-bot: Disable MFSpecialCaseMainPage for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492432 (https://phabricator.wikimedia.org/T216865) (owner: 10Zoranzoki21) [19:50:41] (03CR) 10jenkins-bot: GrowthExperiments: Soft launch of help panel on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489729 (https://phabricator.wikimedia.org/T215666) (owner: 10Kosta Harlan) [19:51:16] 10Operations, 10ops-eqiad, 10ops-eqsin, 10netops, 10Patch-For-Review: Deploy cr2-eqsin - https://phabricator.wikimedia.org/T213121 (10ayounsi) [19:51:34] Reedy: works as expected [19:51:48] SMalyshev: ^ Based on that jenkins spam, your patch should be on beta [19:51:51] stephanebisson: cheers [19:52:00] Reedy: it's working. Please deploy everywhere. Thanks! [19:52:03] sweet [19:52:24] (03PS2) 10Reedy: Revert "Revert "Use EventBus multi endpoint configuration for eventbus configs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 (owner: 10Ottomata) [19:53:04] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Swat! (duration: 00m 45s) [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:17] (03CR) 10Reedy: [C: 03+2] Revert "Revert "Use EventBus multi endpoint configuration for eventbus configs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 (owner: 10Ottomata) [19:54:15] (03Merged) 10jenkins-bot: Revert "Revert "Use EventBus multi endpoint configuration for eventbus configs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 (owner: 10Ottomata) [19:54:44] Reedy: yep it's on beta [19:55:09] doesn't look like anything is broken... [19:55:32] Reedy: thanks [19:55:56] ottomata: It's on mwdebug1002 [19:56:06] k.. [19:56:43] Reedy: thanks, all is well. [19:56:45] proceed! [19:57:17] woohooo! :) [19:58:41] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Use EventBus multi endpoint configuration for eventbus configs (duration: 00m 45s) [19:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:59] marostegui: any concerns about restarting the Photographer rename now that the patch has been deployed? [20:02:43] (03CR) 10jenkins-bot: Revert "Revert "Use EventBus multi endpoint configuration for eventbus configs"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492356 (owner: 10Ottomata) [20:04:50] bd808: is https://gerrit.wikimedia.org/r/#/c/mediawiki/vagrant/+/491819/ ok to merge? [20:05:07] related to https://gerrit.wikimedia.org/r/#/c/mediawiki/vagrant/+/491820/ [20:07:24] ottomata: I haven't tested it, but if you have are are pretty confident it works then merge away ;) [20:07:31] ok thanks [20:07:56] ottomata: I guess the one question I have is if we should just be updating all nodejs in mediawiki-vagrant to v10? [20:08:06] probably should eventually [20:08:07] ya [20:08:32] (03CR) 10Herron: [C: 03+1] admin: add Delphine Menard to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/492754 (https://phabricator.wikimedia.org/T216120) (owner: 10Cwhite) [20:08:43] eventually - yes, but right now, for example maps role will not work with node 10 [20:09:21] Pchelolo: *nod* good to know. I'll leave nodejs things the capable hands of those of you who actually know how to work with it ;) [20:09:43] (03CR) 10Cwhite: [C: 03+2] admin: add Delphine Menard to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/492754 (https://phabricator.wikimedia.org/T216120) (owner: 10Cwhite) [20:09:49] (03PS2) 10Cwhite: admin: add Delphine Menard to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/492754 (https://phabricator.wikimedia.org/T216120) [20:14:40] (03CR) 10BBlack: [C: 03+1] "Seems to do the right thing, and passes manual VTC testing." [puppet] - 10https://gerrit.wikimedia.org/r/490120 (https://phabricator.wikimedia.org/T212197) (owner: 10Dr0ptp4kt) [20:16:41] Q for mw folk: [20:16:43] Uncaught exception 'ConfigException' with message 'MultiConfig::get: undefined option: 'EventServices' [20:16:48] getting that in CI test for a patch [20:16:54] (03PS1) 10Andrew Bogott: labstores: set check_disk_critical: true and profile::base::notifications: critical [puppet] - 10https://gerrit.wikimedia.org/r/492761 (https://phabricator.wikimedia.org/T217068) [20:16:57] is there a special place I need too define my new config? [20:17:03] in order for CI test to pass? [20:17:49] (03PS8) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [20:17:51] (03CR) 10jerkins-bot: [V: 04-1] labstores: set check_disk_critical: true and profile::base::notifications: critical [puppet] - 10https://gerrit.wikimedia.org/r/492761 (https://phabricator.wikimedia.org/T217068) (owner: 10Andrew Bogott) [20:17:55] 10Operations, 10ExternalGuidance, 10Traffic, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10BBlack) The VCL looks good, please give us some notice (~24h would be ideal?) on when you n... [20:19:26] (03PS2) 10Andrew Bogott: labstores: make failures on these hosts page more [puppet] - 10https://gerrit.wikimedia.org/r/492761 (https://phabricator.wikimedia.org/T217068) [20:21:21] (03PS4) 10Alexandros Kosiaris: Switch mathoid_requests_duration to a histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 [20:32:45] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Mholloway) Another alternative would be to switch to using the node-... [20:38:57] PROBLEM - SSH on bast5001 is CRITICAL: Server answer [20:40:11] RECOVERY - SSH on bast5001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) [20:52:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice, a few inline comments" (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492667 (owner: 10Giuseppe Lavagetto) [20:58:05] 10Operations, 10Operations-Software-Development, 10serviceops, 10User-Joe, 10User-jijiki: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10akosiaris) >>! In T203963#4982270, @crusnov wrote: > Is it okay to use rapi for this or is there a compelling reason to use cumin+g... [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T2100). [21:05:31] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@1ac3c38]: Update mobileapps to c3871cc [21:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:20] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@1ac3c38]: Update mobileapps to c3871cc (duration: 03m 48s) [21:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:23] (03CR) 10Ottomata: [C: 03+2] Install maven and ivysettings on all hadoop workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) (owner: 10Ottomata) [21:10:29] !log arlolra@deploy1001 Started deploy [parsoid/deploy@cb62482]: Updating Parsoid to a8fe45e [21:10:29] (03PS2) 10Ottomata: Install maven and ivysettings on all hadoop workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) [21:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:33] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Install maven and ivysettings on all hadoop workers and clients [puppet] - 10https://gerrit.wikimedia.org/r/492361 (https://phabricator.wikimedia.org/T216093) (owner: 10Ottomata) [21:11:03] !log turning down elasticsearch service on logstash100[456] (data has been migrated to logstash101[012]) T213898 [21:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:06] T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 [21:11:19] tgr: I would prefer to do it in the morning (specially with that huge amount of edits on commons) [21:12:07] tgr: EU morning I mean :) [21:13:56] well, EU morning is midnight here [21:14:48] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@cb62482]: Updating Parsoid to a8fe45e (duration: 04m 19s) [21:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:07] can you run the script? I'll probably be around but don't want to mess with production at that time [21:15:54] (03CR) 10Ottomata: [C: 03+1] hadoop: allow yarn rmstore to be stored on HDFS [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492697 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [21:16:04] not that the script does much at this point, it should just schedule a bunch of jobs [21:18:27] marostegui: alternatively, I could run it in the morning here, that's UTC 17 [21:22:09] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 504 (expecting: 200) [21:23:06] (03PS1) 10Herron: logstash: add logstash101[012] to unicast hosts [puppet] - 10https://gerrit.wikimedia.org/r/492769 (https://phabricator.wikimedia.org/T213898) [21:23:13] (03PS1) 10Ppchelko: Remove legacy eventBus config settings. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 [21:23:23] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [21:23:56] (03CR) 10Ppchelko: [C: 04-1] "Need to wait until this week train is deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [21:25:31] (03PS9) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [21:28:37] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/14843/" [puppet] - 10https://gerrit.wikimedia.org/r/492769 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [21:28:44] (03PS2) 10Herron: logstash: add logstash101[012] to unicast hosts [puppet] - 10https://gerrit.wikimedia.org/r/492769 (https://phabricator.wikimedia.org/T213898) [21:30:45] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Release-Engineering-Team (Watching / External), 10Services (watching): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10akosiaris) Oops, sorry f... [21:30:52] (03CR) 10Herron: [C: 03+2] logstash: add logstash101[012] to unicast hosts [puppet] - 10https://gerrit.wikimedia.org/r/492769 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [21:43:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: remove shell access for mkroetzsch on 2019-01-26 - https://phabricator.wikimedia.org/T214498 (10Smalyshev) [21:44:43] (03PS10) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [21:46:12] (03CR) 10jerkins-bot: [V: 04-1] Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [21:48:42] (03PS11) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [21:51:45] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) [21:58:21] i'm investigating the mobileapps flapping. [22:00:05] bawolff and Reedy: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190225T2200). [22:00:21] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [22:11:10] 10Operations, 10Traffic, 10VisualEditor, 10Wikimedia-Apache-configuration: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the