[00:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T0000). [00:00:05] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) I've also made a counter to check how many "forward skips" - i.e. loading revision further than we've... [00:23:02] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [00:24:18] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.408 second response time https://phabricator.wikimedia.org/T174916 [00:28:18] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [00:30:48] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 5.958 second response time https://phabricator.wikimedia.org/T174916 [00:34:46] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [00:40:33] !log restart pdfrender on scb1003 - T174916 [00:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:36] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [00:41:04] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time https://phabricator.wikimedia.org/T174916 [00:43:34] (03PS1) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [00:44:56] (03CR) 10jerkins-bot: [V: 04-1] wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [00:47:20] (03PS2) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [00:59:16] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:01:51] (03PS3) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [01:19:43] (03PS1) 10BryanDavis: wikitech: Provision phabricator api token [puppet] - 10https://gerrit.wikimedia.org/r/501125 (https://phabricator.wikimedia.org/T218654) [01:20:47] (03PS4) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [01:22:20] (03CR) 10BryanDavis: "Related Puppet change in I5d9b339d7093377995df4e92099f795ac5c80890" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [01:26:12] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [01:46:19] (03PS19) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [01:55:44] (03CR) 10Ayounsi: "Tested in a Cloud instance, without actually establishing BGP to a router." [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [02:01:04] RECOVERY - Check systemd state on elastic2048 is OK: OK - running: The system is fully operational [02:29:20] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:33:54] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Aklapper) > In particular, we were told some people thought of making the ticket... [02:37:24] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10RobH) Please note this is an in warranty system, and thus doesn't use onsite spares. @papaul will need to open a dell dispatch for a replacement part. [02:46:04] (03PS2) 10Andrew Bogott: utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [02:46:31] (03CR) 10jerkins-bot: [V: 04-1] utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [02:49:16] (03PS3) 10Andrew Bogott: utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [02:49:43] (03CR) 10jerkins-bot: [V: 04-1] utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [02:56:28] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [03:18:24] (03PS4) 10Andrew Bogott: utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [03:18:49] (03CR) 10jerkins-bot: [V: 04-1] utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [03:29:30] (03PS1) 10Andrew Bogott: flake8 fixes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501130 [03:30:26] (03PS5) 10Andrew Bogott: utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [03:49:53] (03PS6) 10Andrew Bogott: utils.facts_file: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [03:59:38] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:04:00] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:23:25] 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10crusnov) p:05Triage→03Normal [04:26:34] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [04:35:06] PROBLEM - puppet last run on wtp1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:35:36] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:36:06] 10Operations, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Mh-3110) a:03Ottomata [04:38:41] 10Operations, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: Enabling api-request eventgate to group1 caused minor service disruptions - https://phabricator.wikimedia.org/T218255 (10Mh-3110) Hi @Ottomata, assi... [04:54:15] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10Marostegui) 05Open→03Resolved All good! Thanks! ` root@db2070:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337FADD0) Port Name: 1I Port Name:... [04:58:49] !log Deploy schema change on labswiki for the job table - T219887 [04:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:53] T219887: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 [05:01:32] RECOVERY - puppet last run on wtp1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:06:38] (03PS1) 10Marostegui: db2033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/501134 (https://phabricator.wikimedia.org/T219493) [05:07:50] (03CR) 10Marostegui: [C: 03+2] db2033: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/501134 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:13:10] (03CR) 10Marostegui: [C: 03+1] mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [05:14:31] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501135 (https://phabricator.wikimedia.org/T219493) [05:15:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501135 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:16:57] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501135 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:18:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2033 for decommission T219493 (duration: 00m 59s) [05:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:30] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [05:19:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2033 for decommission T219493 (duration: 00m 59s) [05:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:23] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2033 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501135 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:31:33] (03PS1) 10Marostegui: mariadb: db2033 set to spare [puppet] - 10https://gerrit.wikimedia.org/r/501136 (https://phabricator.wikimedia.org/T219493) [05:32:05] !log Remove db2033 from tendril and zarcillo - T219493 [05:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:08] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [05:34:15] (03CR) 10Marostegui: [C: 03+2] mariadb: db2033 set to spare [puppet] - 10https://gerrit.wikimedia.org/r/501136 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [05:39:38] !log Stop MySQL on db2033 for decommission - T219493 [05:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:42] T219493: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 [05:44:32] 10Operations, 10ops-codfw: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Marostegui) [05:46:04] 10Operations, 10ops-codfw: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Marostegui) p:05Triage→03Normal [05:46:36] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:12:15] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:17:05] (03PS1) 10Marostegui: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) [06:17:20] (03CR) 10Marostegui: [C: 04-1] "Wait for Monday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [06:30:22] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501147 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [06:38:08] 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10Dzahn) Earlier i had re-enabled monitoring and started the service. I think it's known to @ssastry that it works for a while and then failed before. [06:52:13] (03Abandoned) 10Mathew.onipe: acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:02:11] 10Operations, 10ops-codfw: Degraded RAID on db2033 - https://phabricator.wikimedia.org/T220074 (10ops-monitoring-bot) [07:06:29] (03CR) 10Vgutierrez: [C: 03+2] sslcert: update-ocsp: Fix passing Host header in absence of proxy [puppet] - 10https://gerrit.wikimedia.org/r/500398 (owner: 10Alex Monk) [07:06:43] (03PS3) 10Vgutierrez: sslcert: update-ocsp: Fix passing Host header in absence of proxy [puppet] - 10https://gerrit.wikimedia.org/r/500398 (owner: 10Alex Monk) [07:09:24] 10Operations, 10ops-codfw: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Marostegui) [07:23:28] (03CR) 10Vgutierrez: [C: 03+2] profile::cache::ssl::wikibase: Simplify [puppet] - 10https://gerrit.wikimedia.org/r/500973 (owner: 10Alex Monk) [07:23:36] (03PS4) 10Vgutierrez: profile::cache::ssl::wikibase: Simplify [puppet] - 10https://gerrit.wikimedia.org/r/500973 (owner: 10Alex Monk) [07:35:00] (03Restored) 10Mathew.onipe: acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:35:42] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:41:54] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [07:42:08] (03PS3) 10Vgutierrez: acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/500940 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:06:01] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @EBernhardson on stat1005 `mivisionx` was causing a broken apt... [08:06:15] !log uploaded Apache 2.4.10-10+deb8u14+wmf1 to apt.wikimedia.org/jessie-wikimedia (latest jessie security update rebased with our local patches) [08:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:07] 10Operations, 10ops-eqiad, 10DBA: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) We have agreed that we want to aim for **10th April** to avoid the risk of the master going down unexpectedly during the upcoming Easter holidays where there will be less c... [08:09:52] (03PS2) 10Muehlenhoff: Add qemu processes/Ganeti instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/500756 (https://phabricator.wikimedia.org/T135991) [08:10:41] (03CR) 10Vgutierrez: "hmm it shouldn't be better to move certs parameter population to hiera and set it as empty when acme-chief are intended to serve user traf" [puppet] - 10https://gerrit.wikimedia.org/r/500631 (https://phabricator.wikimedia.org/T182927) (owner: 10Alex Monk) [08:11:02] (03CR) 10Muehlenhoff: [C: 03+2] Add qemu processes/Ganeti instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/500756 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:11:56] (03PS1) 10Elukey: admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) [08:25:20] (03PS1) 10Gehel: elasticsearch: use NodesGroup instead of free form json [software/spicerack] - 10https://gerrit.wikimedia.org/r/501157 [08:29:41] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:26] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: use NodesGroup instead of free form json [software/spicerack] - 10https://gerrit.wikimedia.org/r/501157 (owner: 10Gehel) [08:36:08] !log rolling restart of parsoid to pick up OpenSSL security update [08:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:18] (03PS2) 10Gehel: elasticsearch: use NodesGroup instead of free form json [software/spicerack] - 10https://gerrit.wikimedia.org/r/501157 [08:46:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, let's do it!" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [08:51:09] 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Peachey88) [08:56:45] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [08:58:51] 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10fgiunchedi) [09:05:49] (03PS1) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [09:08:52] 10Operations, 10Packaging: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10jbond) we will also need to configure http proxy for the security updates [09:10:57] (03CR) 10Filippo Giunchedi: [C: 03+1] RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854) (owner: 10Volans) [09:20:59] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite) [09:27:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/495980 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [09:33:37] (03CR) 10Volans: [C: 03+1] "Code looks ok to me, I'm not too familiar with ES to fully validate the test's logic though." [software/spicerack] - 10https://gerrit.wikimedia.org/r/501157 (owner: 10Gehel) [09:35:02] (03PS6) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [09:36:04] (03PS2) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [09:43:49] 10Operations, 10Packaging: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10akosiaris) > Currently the /etc/apt/sources.list for the pbuilder base images are missing entries for the security suites. Theses files should be updated and managed by puppet. Wh... [09:46:09] (03PS3) 10Volans: tests: mark test strings with escapes as raw [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 [09:46:11] (03PS5) 10Volans: Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) [09:46:13] (03PS5) 10Volans: DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) [09:46:15] (03PS5) 10Volans: Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) [09:47:51] 10Operations, 10Packaging: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10MoritzMuehlenhoff) >>! In T220003#5084382, @akosiaris wrote: >> Currently the /etc/apt/sources.list for the pbuilder base images are missing entries for the security suites. Theses... [09:48:28] (03PS1) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [09:49:11] (03CR) 10jerkins-bot: [V: 04-1] ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:49:14] (03PS1) 10Volans: sre.hosts.downtime: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/501161 [09:49:30] (03CR) 10Volans: "addressed comment" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [09:50:31] (03PS1) 10Elukey: cumin: add hadoop-hdfs-backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/501162 (https://phabricator.wikimedia.org/T218343) [09:50:55] (03CR) 10Elukey: [C: 03+1] sre.hosts.downtime: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/501161 (owner: 10Volans) [09:51:32] (03CR) 10Volans: [C: 03+2] sre.hosts.downtime: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/501161 (owner: 10Volans) [09:52:08] (03PS1) 10Jbond: pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T219803) [09:52:30] 10Operations, 10Packaging: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10akosiaris) >>! In T220003#5084386, @MoritzMuehlenhoff wrote: >>>! In T220003#5084382, @akosiaris wrote: >>> Currently the /etc/apt/sources.list for the pbuilder base images are mis... [09:52:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [09:53:04] (03Merged) 10jenkins-bot: sre.hosts.downtime: fix typo in argument name [cookbooks] - 10https://gerrit.wikimedia.org/r/501161 (owner: 10Volans) [09:55:39] 10Operations, 10monitoring, 10Proposal: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158 (10fgiunchedi) >>! In T126158#5073889, @jcrespo wrote: > I was convinced, this is desirable, but I don't see a way to move forward th... [09:56:40] (03PS2) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [09:57:29] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501162 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [09:59:08] (03PS2) 10Volans: icinga: sync only if config is valid and log it [puppet] - 10https://gerrit.wikimedia.org/r/501083 [09:59:37] PROBLEM - Apache HTTP on mw1313 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:00:25] (03CR) 10Volans: [C: 03+2] icinga: sync only if config is valid and log it [puppet] - 10https://gerrit.wikimedia.org/r/501083 (owner: 10Volans) [10:00:55] RECOVERY - Apache HTTP on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:01:46] !log installing openssl1.0 security updates on stretch-based DB hosts [10:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:19] (03PS2) 10Arturo Borrero Gonzalez: wikitech: Provision phabricator api token [puppet] - 10https://gerrit.wikimedia.org/r/501125 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [10:02:29] (03CR) 10Elukey: [C: 03+2] cumin: add hadoop-hdfs-backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/501162 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [10:02:36] (03PS2) 10Elukey: cumin: add hadoop-hdfs-backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/501162 (https://phabricator.wikimedia.org/T218343) [10:02:54] 10Operations, 10monitoring, 10Proposal: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158 (10jcrespo) Yes, and we would want to do that- but on a dashboard level (outside of the scope), not hard alert level. For example, t... [10:03:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikitech: Provision phabricator api token [puppet] - 10https://gerrit.wikimedia.org/r/501125 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [10:04:40] (03PS3) 10Elukey: cumin: add hadoop-hdfs-backup aliases [puppet] - 10https://gerrit.wikimedia.org/r/501162 (https://phabricator.wikimedia.org/T218343) [10:11:12] (03CR) 10Arturo Borrero Gonzalez: "How was this working in the main deployment?" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [10:11:52] (03PS4) 10Arturo Borrero Gonzalez: openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) [10:14:21] (03PS5) 10Arturo Borrero Gonzalez: openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) [10:14:34] (03CR) 10Vgutierrez: "looks good, but I'm wondering if ocsp_proxy would be a better/clearer name" [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [10:16:04] 10Operations, 10Packaging: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10MoritzMuehlenhoff) >>! In T220003#5084401, @akosiaris wrote: > Ah, so `jessie-security` is partly behaving like `backports` in a sense. OK, so my assumption wasn't entirely correct... [10:16:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: require apt-get update before moving on [puppet] - 10https://gerrit.wikimedia.org/r/500977 (https://phabricator.wikimedia.org/T219981) (owner: 10Arturo Borrero Gonzalez) [10:16:16] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10fgiunchedi) I looked into this as well and the rsyslog 8.38 prerm doesn't have the extra `[ "$1" = remove ]` test guarding invoke-rc.d, which 8.1901 does have instead, hence why rsyslog stop... [10:16:30] gehel: i see you are logged in on bast2001, heads up it is going to go down, replaced by bast2002 [10:17:25] (03PS3) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [10:17:27] (03PS1) 10Ema: ATS: custom WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/501168 (https://phabricator.wikimedia.org/T213263) [10:17:44] (03Abandoned) 10Filippo Giunchedi: prometheus: introduce query/connection limits parameters [puppet] - 10https://gerrit.wikimedia.org/r/494685 (https://phabricator.wikimedia.org/T217715) (owner: 10Filippo Giunchedi) [10:18:35] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10fgiunchedi) >>! In T217715#5078801, @CDanis wrote: > Filippo, did you decide r494685 wasn't necessary? I... [10:20:22] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10MoritzMuehlenhoff) >>! In T219764#5084421, @fgiunchedi wrote: > I looked into this as well and the rsyslog 8.38 prerm doesn't have the extra `[ "$1" = remove ]` test guarding invoke-rc.d, wh... [10:21:27] !log T219626 reimaging cloudcontrol2001-dev again [10:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:31] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [10:21:51] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:25] (03PS3) 10Volans: RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854) [10:25:41] (03CR) 10Volans: [C: 03+2] RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854) (owner: 10Volans) [10:26:33] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10fgiunchedi) >>! In T219764#5084446, @MoritzMuehlenhoff wrote: >>>! In T219764#5084421, @fgiunchedi wrote: >> I looked into this as well and the rsyslog 8.38 prerm doesn't have the extra `[ "... [10:27:15] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) @fgiunchedi what are your thoughts on T219854#5076968 ? That's the last remaining part of this task I guess. [10:27:22] (03PS4) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [10:28:22] mutante: thanks for the heads up. At lunch, but kill the session if needed [10:28:28] (03PS1) 10Elukey: Revert "admin: temporary remove piccardi from analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/501170 [10:28:43] (03PS2) 10Elukey: Revert "admin: temporary remove piccardi from analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/501170 [10:29:06] gehel: no need to kill anything.. realized i am still blocked on router ACL to allow it to talk to mgmt hosts as well [10:30:16] (03CR) 10Elukey: [C: 03+2] Revert "admin: temporary remove piccardi from analytics-privatedata-users" [puppet] - 10https://gerrit.wikimedia.org/r/501170 (owner: 10Elukey) [10:31:13] 10Operations, 10monitoring, 10Proposal: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158 (10fgiunchedi) >>! In T126158#5084407, @jcrespo wrote: > Yes, and we would want to do that- but on a dashboard level (outside of the... [10:32:00] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: introduce placeholder for the new pab API token [labs/private] - 10https://gerrit.wikimedia.org/r/501173 [10:32:23] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: openstack: introduce placeholder for the new pab API token [labs/private] - 10https://gerrit.wikimedia.org/r/501173 (owner: 10Arturo Borrero Gonzalez) [10:32:43] (03CR) 10Vgutierrez: [C: 03+1] ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [10:32:43] PROBLEM - Disk space on dbprov2001 is CRITICAL: DISK CRITICAL - /srv/backups/dumps/ongoing is not accessible: Permission denied [10:33:08] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10fgiunchedi) >>! In T219854#5076968, @Volans wrote: > So the `dsa-check-hpssacli` check is happily returning `0` exit code and this output: > ` > OK: Slot 0: no logical drives --- Slot 0: no... [10:36:42] (03PS1) 10Mathew.onipe: acme_chief: generate cert for each cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/501174 (https://phabricator.wikimedia.org/T214921) [10:41:01] 10Operations, 10SRE-Access-Requests: Requesting access to analytics machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10Lucas_Werkmeister_WMDE) [10:44:05] 10Operations, 10SRE-Access-Requests: Requesting access to analytics machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10RazShuty) As the Engineering Manager of @Lucas_Werkmeister_WMDE I totally approve this request on my side and think it's super important for us a team to not lose th... [10:46:25] 10Operations, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10Joe) [10:46:33] 10Operations, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10Joe) p:05Triage→03Normal [10:46:49] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10fgiunchedi) >>! In T219764#5073355, @Krenair wrote: > This was on deployment-sca02 but the list of deployment-prep instances failing puppet grew suddenly, so I expect a few of these were aff... [10:47:17] 10Operations, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10Joe) [10:49:43] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: fix path of wikitech phab API token [labs/private] - 10https://gerrit.wikimedia.org/r/501175 [10:50:23] (03PS3) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [10:52:40] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:53:32] PROBLEM - puppet last run on labweb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:53:33] (03CR) 10Vgutierrez: [C: 04-2] acme_chief: generate cert for each cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/501174 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [10:56:32] !log uploaded hhvm-wikidiff 1.8.1 to apt.wikimedia.org/stretch-wikimedia (source package is named php-wikdiff2 for legacy reasons) (T203069) [10:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:36] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [10:57:46] (03CR) 10Volans: [C: 04-1] "One missing thing, looks good otherwise." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501104 (owner: 10CRusnov) [10:58:38] RECOVERY - puppet last run on labweb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:58:54] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:59:58] (03PS5) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1100). [11:00:04] CFisch_WMDE and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] o/ [11:00:22] \o/ [11:00:24] Mine is not testable [11:00:29] (03PS2) 10Ema: ATS: custom WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/501168 (https://phabricator.wikimedia.org/T213263) [11:02:48] Mine is testable :-) [11:03:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, now this matches the list of domains I use in my QuickCategories tool (which I based on the domains in the main Wikimedia TLS certif" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500976 (owner: 10Ladsgroup) [11:06:27] (03PS2) 10WMDE-Fisch: Enable ReferencePreviews beta feature on de- and ar-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498371 (https://phabricator.wikimedia.org/T218766) [11:07:11] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: drop python3-psutil from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/501177 [11:08:09] !log drop python-psutil from jessie-wikimedia/openstack-mitaka-jessie, related to T219626 [11:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:13] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [11:08:24] Anyone doing the SWAT? ^^' [11:08:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop python3-psutil from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/501177 (owner: 10Arturo Borrero Gonzalez) [11:08:38] (03PS1) 10Jbond: debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 [11:09:29] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:10:12] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:10:32] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:11:32] I can do it if no one is around [11:11:52] kewl [11:12:23] (03PS2) 10Jbond: pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T219803) [11:14:36] (03PS2) 10Jbond: debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 [11:15:36] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:17:12] Amir1: Please go for it if that's fine. [11:17:20] 10Operations, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10Joe) I think I found the culprit: when requesting directly to the registry, I do see the correct manifest is returned *and* a content-length header is... [11:17:20] sure [11:17:26] 10Operations, 10Traffic, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10Joe) [11:17:54] (03PS6) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [11:18:02] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498371 (https://phabricator.wikimedia.org/T218766) (owner: 10WMDE-Fisch) [11:19:08] (03Merged) 10jenkins-bot: Enable ReferencePreviews beta feature on de- and ar-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498371 (https://phabricator.wikimedia.org/T218766) (owner: 10WMDE-Fisch) [11:19:38] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:19:47] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:19:53] (03CR) 10Vgutierrez: [C: 04-1] "Be careful, as pcc shows: https://puppet-compiler.wmflabs.org/compiler1002/15563/" [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [11:19:55] (03CR) 10Volans: [C: 04-1] "Looks good, few minor things inline. Given that the change in the API side needs coordination to deploy in any case I think we can bundle " (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [11:20:23] thanks Amir1, I was in a meeting and forgot about swat [11:21:05] (03CR) 10jerkins-bot: [V: 04-1] debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:21:14] (03PS3) 10Jbond: debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 [11:21:18] !log T219626 reimaging cloudcontrol2001-dev again [11:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:26] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [11:21:33] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: drop gdb from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/501180 (https://phabricator.wikimedia.org/T219626) [11:21:50] CFisch_WMDE: it's live in mwdebug1002 [11:22:07] 10Operations, 10Traffic, 10docker-pkg, 10serviceops: Getting registry metadata from a public client fails on our registry - https://phabricator.wikimedia.org/T220085 (10Joe) p:05Normal→03Low To be clear, here "docker client" means "the docker daemon running on some computer", as any docker client libra... [11:22:15] * CFisch_WMDE testing [11:22:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop gdb from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/501180 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:23:03] Amir1: Works like a charm, go for it plz :-) [11:23:08] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [11:24:48] okay, going live [11:25:41] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:498371|Enable ReferencePreviews beta feature on de- and ar-wiki (T218766)]] (duration: 01m 00s) [11:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:45] T218766: Prepare deployment of ReferencePreviews beta feature on de- and ar-wiki - https://phabricator.wikimedia.org/T218766 [11:25:52] Done, CFisch_WMDE take a look please [11:26:18] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) contacted Fussi at https://de.wikivoyage.org/wiki/Benutzer_Diskussion:DerFussi#Statu... [11:26:28] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [11:27:34] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: add IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/501181 (https://phabricator.wikimedia.org/T219626) [11:27:35] Amir1: Works fine ty! [11:27:57] ACKNOWLEDGEMENT - Disk space on dbprov2001 is CRITICAL: DISK CRITICAL - /srv/backups/dumps/ongoing is not accessible: Permission denied Jcrespo Pending new server deployment fix [11:28:01] !log rolling security updates for apache on jessie [11:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: add IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/501181 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:28:34] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500976 (owner: 10Ladsgroup) [11:29:19] (03PS7) 10Ema: ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) [11:29:22] ACKNOWLEDGEMENT - Disk space on dbprov2002 is CRITICAL: DISK CRITICAL - /srv/backups/dumps/ongoing is not accessible: Permission denied Jcrespo pending new server deployment fix [11:29:33] (03Merged) 10jenkins-bot: Add mediawiki.org to the URL shortener whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500976 (owner: 10Ladsgroup) [11:29:37] (03PS1) 10Giuseppe Lavagetto: Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 [11:29:39] (03PS1) 10Giuseppe Lavagetto: Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 [11:29:41] (03PS1) 10Giuseppe Lavagetto: Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 [11:29:45] (03PS1) 10Giuseppe Lavagetto: Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 [11:30:01] (03CR) 10jenkins-bot: Enable ReferencePreviews beta feature on de- and ar-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498371 (https://phabricator.wikimedia.org/T218766) (owner: 10WMDE-Fisch) [11:30:25] (03CR) 10jenkins-bot: Add mediawiki.org to the URL shortener whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500976 (owner: 10Ladsgroup) [11:30:38] (03PS1) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) [11:30:57] (03CR) 10jerkins-bot: [V: 04-1] Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 (owner: 10Giuseppe Lavagetto) [11:30:59] (03CR) 10jerkins-bot: [V: 04-1] Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 (owner: 10Giuseppe Lavagetto) [11:31:17] (03CR) 10jerkins-bot: [V: 04-1] Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 (owner: 10Giuseppe Lavagetto) [11:31:20] (03CR) 10jerkins-bot: [V: 04-1] Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 (owner: 10Giuseppe Lavagetto) [11:31:48] (03CR) 10Ema: [C: 03+2] ATS: add support for custom error messages [puppet] - 10https://gerrit.wikimedia.org/r/501160 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [11:32:05] (03PS3) 10Ema: ATS: custom WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/501168 (https://phabricator.wikimedia.org/T213263) [11:32:08] !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:500976|Add mediawiki.org to the URL shortener whitelist]] (duration: 00m 58s) [11:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:25] (03CR) 10Ema: [C: 03+2] ATS: custom WMF error page [puppet] - 10https://gerrit.wikimedia.org/r/501168 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [11:34:15] (03PS3) 10Giuseppe Lavagetto: Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 [11:34:18] (03PS2) 10Giuseppe Lavagetto: Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 [11:34:20] (03PS2) 10Giuseppe Lavagetto: Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 [11:34:22] (03PS2) 10Giuseppe Lavagetto: Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 [11:34:24] (03PS2) 10Giuseppe Lavagetto: Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 [11:34:53] !log EU SWAT is done [11:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:58] !log uploaded nodejs 10.15.2~dfsg-1+wmf1 to the component/node10 component of apt.wikimedia.org/stretch-wikimedia (updated to latest 10.x release and a change to ensure zlib binary compat with NodeSource) (T215562) [11:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:01] T215562: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 [11:36:05] (03CR) 10Mathew.onipe: "PCC is noop: https://puppet-compiler.wmflabs.org/compiler1002/15564/" [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:40:38] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) @Vgutierrez So it turns out back in 2016 this was discussed with WMUK and they agreed on parking / deactivating it (T128085#2065197). This is not getting any traffic since the park... [11:41:00] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) fwiw Chuck replied and said this is already not controlled by WMF anymore and that's why it's not in MarkMonitor [11:41:14] (03CR) 10Jbond: admin: add gpu-users group and assign it to stat1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [11:41:51] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Krinkle >>! In T215562#5082327, @K... [11:43:23] !log upgrading HHVM on mwdebug servers in eqiad along with update to hhvm-wikidiff 1.8.1 [11:43:24] (03CR) 10Vgutierrez: [C: 04-1] "you don't need to change acme_chief::cert at all" [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:53] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [11:45:13] effie: ^^ that's you I assume [11:45:24] no [11:45:35] I havent done anything yet [11:46:40] (03CR) 10Elukey: admin: add gpu-users group and assign it to stat1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [11:46:44] it's already depooled I'd assume [11:46:51] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [11:47:10] so pybal is complaining cause it won't get ipvs refreshed till it's restarted [11:47:31] PROBLEM - Check systemd state on cloudnet2003-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:48:07] RECOVERY - puppet last run on cloudcontrol2001-dev is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:49:39] (03PS1) 10Elukey: role::statistics::gpu: remove stat1005 specific config [puppet] - 10https://gerrit.wikimedia.org/r/501192 [11:50:19] (03CR) 10Dzahn: admin: add gpu-users group and assign it to stat1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [11:51:50] (03CR) 10Elukey: [C: 03+2] role::statistics::gpu: remove stat1005 specific config [puppet] - 10https://gerrit.wikimedia.org/r/501192 (owner: 10Elukey) [11:52:02] (03CR) 10Dzahn: [C: 03+1] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/501192 (owner: 10Elukey) [11:52:41] mutante,jbond42 - it was also wrong, since I didn't realize that my team members were not allowed to ssh :P [11:52:48] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/501193 (https://phabricator.wikimedia.org/T219626) [11:52:51] thanks for the suggestions [11:52:56] elukey: :) [11:54:59] (03PS3) 10Jbond: pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) [11:55:08] (03PS2) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) [11:55:45] (03PS2) 10Elukey: admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) [11:56:09] done! [11:58:19] PROBLEM - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,1 instance=db2044:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [11:59:42] (03CR) 10Jcrespo: "It is unclear to me, based on the tickets, what are the consequences of this- this may require additional filtering or unfiltering from pr" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1200) [12:00:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/501193 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [12:00:35] RECOVERY - Check systemd state on cloudnet2003-dev is OK: OK - running: The system is fully operational [12:01:18] (03PS1) 10Ema: ATS: fix template_sets_dir [puppet] - 10https://gerrit.wikimedia.org/r/501195 (https://phabricator.wikimedia.org/T213263) [12:01:49] (03CR) 10Mathew.onipe: "PCC is noop with (including cp* nodes): https://puppet-compiler.wmflabs.org/compiler1002/15567/" [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:02:35] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:02:40] (03PS4) 10Alexandros Kosiaris: Remove citoid role/profile [puppet] - 10https://gerrit.wikimedia.org/r/494215 (https://phabricator.wikimedia.org/T213194) [12:02:47] (03CR) 10Ema: [C: 03+2] ATS: fix template_sets_dir [puppet] - 10https://gerrit.wikimedia.org/r/501195 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [12:03:41] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:05:43] RECOVERY - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is OK: OK ferm input default policy is set [12:10:37] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:10:47] !log T219626 reimaging cloudcontrol2001-dev again [12:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:50] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [12:14:49] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:46] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Yann) Hi, Sorry, but where to discuss this if not here? I agree that many users... [12:17:08] (03PS3) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) [12:18:11] (03CR) 10Muehlenhoff: [C: 03+1] "Nice, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [12:18:20] (03CR) 10Volans: [C: 03+2] tests: mark test strings with escapes as raw [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [12:19:29] (03CR) 10Vgutierrez: [C: 04-1] "almost :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:20:11] (03Merged) 10jenkins-bot: tests: mark test strings with escapes as raw [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [12:21:37] (03PS4) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) [12:22:08] (03CR) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:22:14] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Vgutierrez) Then IMHO we should get rid of it at operations/dns and in redirects.dat in operations/puppet, what are your thoughts @BBlack? [12:23:40] (03CR) 10Alex Monk: [C: 04-1] " or stuff like https://gerrit.wikimedia.org/r/c/operations/puppet/+/501174 happens" [puppet] - 10https://gerrit.wikimedia.org/r/501174 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:24:18] (03CR) 10Vgutierrez: tlsproxy::localssl: split title and acme cert name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:24:20] (03CR) 10Meshvogel: "> It is unclear to me, based on the tickets, what are the" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [12:25:22] jouncebot: now [12:25:22] For the next 0 hour(s) and 34 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1200) [12:26:33] (03CR) 10Alex Monk: "This looks good, I agree with Valentin's comment about the doc wording" [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:26:37] (03PS1) 10Ema: ATS: add request/response details to error message [puppet] - 10https://gerrit.wikimedia.org/r/501198 (https://phabricator.wikimedia.org/T213263) [12:31:40] (03PS5) 10Effie Mouzeli: Remove citoid role/profile [puppet] - 10https://gerrit.wikimedia.org/r/494215 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [12:32:11] 10Operations, 10Traffic, 10Patch-For-Review: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) Here is how our custom ATS errors look like. {F28578049} [12:33:22] (03CR) 10Ema: [C: 03+2] ATS: add request/response details to error message [puppet] - 10https://gerrit.wikimedia.org/r/501198 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [12:33:29] (03CR) 10Jbond: "i think you will need to admin::group entry from hieradata/hosts/stat1005.yaml. The lookup merge method so if it sees an entry here it w" [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [12:33:45] (03CR) 10Effie Mouzeli: [C: 03+2] Remove citoid role/profile [puppet] - 10https://gerrit.wikimedia.org/r/494215 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [12:34:43] PROBLEM - Check systemd state on cloudnet2003-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:35:58] (03PS6) 10Effie Mouzeli: Remove citoid role/profile [puppet] - 10https://gerrit.wikimedia.org/r/494215 (https://phabricator.wikimedia.org/T213194) (owner: 10Alexandros Kosiaris) [12:36:02] (03CR) 10Jcrespo: "> But apparently it used to be" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [12:36:09] (03PS5) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) [12:36:52] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) 05Open→03Stalled stalled by T219384 [12:37:19] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) [12:37:37] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:38:55] (03PS7) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [12:39:22] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [12:39:33] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:40:13] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:40:19] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [12:40:25] (03CR) 10Jbond: [C: 03+2] debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 (owner: 10Jbond) [12:40:39] (03PS4) 10Jbond: debdeploy: change merge behaviour [puppet] - 10https://gerrit.wikimedia.org/r/501178 [12:40:47] ACKNOWLEDGEMENT - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219933 [12:41:16] (03PS1) 10Dzahn: apache redirects: remove wicipediacymraeg.org [puppet] - 10https://gerrit.wikimedia.org/r/501202 (https://phabricator.wikimedia.org/T219856) [12:41:32] 10Operations: parsoid-vd on scandium randomly died - https://phabricator.wikimedia.org/T219933 (10Dzahn) 08:39 <+icinga-wm> PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:41:39] (03PS8) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [12:44:21] (03CR) 10Mathew.onipe: tlsproxy::localssl: split title and acme cert name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:44:44] vgutierrez, Krenair ^ [12:45:13] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) fresh Icinga alert for this since about 23 hours [12:45:32] (03CR) 10Gehel: [C: 03+2] elasticsearch: use NodesGroup instead of free form json [software/spicerack] - 10https://gerrit.wikimedia.org/r/501157 (owner: 10Gehel) [12:45:39] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [12:45:54] /query mutante [12:45:58] damn [12:46:07] :) [12:46:24] onimisionipe: looking good, I'll merge it in a few minutes [12:46:24] (03CR) 10jenkins-bot: elasticsearch: use NodesGroup instead of free form json [software/spicerack] - 10https://gerrit.wikimedia.org/r/501157 (owner: 10Gehel) [12:46:43] vgutierrez: Thanks! [12:46:43] !log uploaded HHVM 3.18.5+dfsg-1+wmf8+deb9u2 to apt.wikimedia.org/stretch-wikimedia [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:47] (03CR) 10Alex Monk: [C: 03+1] tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:48:55] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [12:49:03] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10jijiki) @robH do we have an update? [12:51:46] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.22 [software/spicerack] - 10https://gerrit.wikimedia.org/r/501204 [12:51:48] (03Abandoned) 10Mathew.onipe: acme_chief: generate cert for each cirrus clusters [puppet] - 10https://gerrit.wikimedia.org/r/501174 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:53:11] (03CR) 10Vgutierrez: [C: 03+2] tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:53:24] (03PS6) 10Vgutierrez: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:53:29] (03PS9) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [12:53:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729 (owner: 10Giuseppe Lavagetto) [12:53:56] (03PS2) 10Giuseppe Lavagetto: uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729 [12:53:58] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) >>! In T218024#5049770, @bd808 wrote: >>>! In T218024#5045062, @aborrero wrote: >> * labtestwiki seems to be a... [12:56:13] <_joe_> waiting for jenkins... [12:56:48] me too... [12:56:49] :( [12:56:53] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [12:56:53] <_joe_> 3 minutes rule! [12:56:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729 (owner: 10Giuseppe Lavagetto) [12:57:10] wut? [12:57:21] that's how I get puppet snipped by _joe_ [12:57:24] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.22 [software/spicerack] - 10https://gerrit.wikimedia.org/r/501204 (owner: 10Volans) [12:57:35] <_joe_> ;P [12:58:20] (03PS7) 10Vgutierrez: tlsproxy::localssl: split title and acme cert name [puppet] - 10https://gerrit.wikimedia.org/r/501187 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:58:33] wait another 5 minutes [12:58:38] this is worse than xkcd.com/303 [12:59:54] <_joe_> vgutierrez: or you could fix jenkins [12:59:55] RECOVERY - Check systemd state on cloudnet2003-dev is OK: OK - running: The system is fully operational [13:00:00] <_joe_> wink wink [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1300) [13:02:05] PROBLEM - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [13:02:07] ACKNOWLEDGEMENT - HP RAID on db2044 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220102 [13:02:17] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10ops-monitoring-bot) [13:02:28] Another one in codfw.. [13:02:37] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.22 [software/spicerack] - 10https://gerrit.wikimedia.org/r/501204 (owner: 10Volans) [13:03:03] (03PS2) 10Giuseppe Lavagetto: graphite: correctly set Cache-control: no-store [puppet] - 10https://gerrit.wikimedia.org/r/500730 [13:03:09] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) p:05Triage→03Normal a:03Papaul Can we get it replaced? Thanks! [13:03:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [13:03:12] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi) [13:04:19] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:04:28] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [13:04:59] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: Investigate distributed and long term storage solutions for Prometheus - https://phabricator.wikimedia.org/T213918 (10fgiunchedi) [13:05:17] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [13:05:19] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [13:05:21] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: Investigate distributed and long term storage solutions for Prometheus - https://phabricator.wikimedia.org/T213918 (10fgiunchedi) [13:05:32] <_joe_> vgutierrez: did your change merge? [13:05:45] yep [13:06:04] at least according to puppet-merge [13:06:27] 10Operations, 10monitoring, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [13:06:50] <_joe_> no I meant if you got a +2 from jenkins, sorry [13:06:52] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [13:06:55] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [13:06:58] 10Operations, 10monitoring, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [13:07:00] (03PS1) 10Volans: Upstream release v0.0.22 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/501311 [13:07:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] graphite: correctly set Cache-control: no-store [puppet] - 10https://gerrit.wikimedia.org/r/500730 (owner: 10Giuseppe Lavagetto) [13:07:43] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [13:07:44] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [13:09:31] _joe_: yes [13:10:04] _joe_: you can see it in https://gerrit.wikimedia.org/r/c/operations/puppet/+/501187 [13:10:10] 1m 11secs later [13:10:57] <_joe_> that's the time the CI run took [13:11:05] <_joe_> the preceding one took 43 seconds [13:11:19] <_joe_> but it started after 4 minutes :) [13:12:15] so mine started after 1 minute [13:12:46] (03PS4) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [13:14:29] (03PS5) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [13:15:08] !log restart of phabricator apache service will occure at 14:25 [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:52] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.22 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/501311 (owner: 10Volans) [13:16:06] 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10Joe) p:05Triage→03Normal I did add the proper caching headers to graphite, so at least now we won't cache checks anymore at the edge. I still think we need to avoid going t... [13:16:27] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:09] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:11] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:11] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:35] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:45] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:51] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:17:53] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:18:04] (03PS1) 10Arturo Borrero Gonzalez: labtestservices2002: rename to cloudservices2002-dev and put into service [puppet] - 10https://gerrit.wikimedia.org/r/501314 (https://phabricator.wikimedia.org/T220101) [13:19:49] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [13:20:25] !log Stopped all citoid services from scb* - 494215 [13:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:29] (03PS1) 10Filippo Giunchedi: grafana: add objects by container type panel [puppet] - 10https://gerrit.wikimedia.org/r/501315 [13:20:31] (03Merged) 10jenkins-bot: Upstream release v0.0.22 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/501311 (owner: 10Volans) [13:20:47] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:21:07] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: add objects by container type panel [puppet] - 10https://gerrit.wikimedia.org/r/501315 (owner: 10Filippo Giunchedi) [13:21:45] (03PS6) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [13:23:34] !log upgrading mw1261 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 / wikidiff 1.8.1 [13:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:03] 10Operations, 10DBA, 10Gerrit, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) 05Stalled→03Resolved Closing this as resolved since this will be resolved with T200739 [13:25:06] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562 (10Paladox) [13:25:30] (03CR) 10Volans: "Doing some tests I've found an issue, see inline." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [13:26:21] (03PS10) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [13:27:09] !log uploaded spicerack_0.0.22-1_amd64.deb to apt.wikimedia.org stretch-wikimedia [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:17] 10Operations, 10monitoring, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10fgiunchedi) [13:27:38] (03CR) 10jerkins-bot: [V: 04-1] tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 (owner: 10Alex Monk) [13:27:41] 10Operations, 10monitoring, 10Patch-For-Review: google safe browsing icinga checks sporadic UNKNOWN due to 404 - https://phabricator.wikimedia.org/T216985 (10fgiunchedi) The lot has been returning 404 consistently now for a week, I've acked the alerts [13:28:32] !log upgraded spicerack to 0.0.22 on cumin[12]001 [13:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:35] gehel ^^^ [13:28:56] volans: \o/ Thanks! [13:29:11] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:29:11] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1002/15580/" [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:29:45] !log restart of gerrit apache service will occure at 13:40 [13:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:58] (03PS11) 10Alex Monk: tlsproxy::localssl: No hardcoding of prod webproxy hostname [puppet] - 10https://gerrit.wikimedia.org/r/500406 [13:32:19] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:34:45] (03PS2) 10Anomie: Set actor migration to read-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) [13:34:54] (03CR) 10Anomie: Set actor migration to read-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:34:58] (03CR) 10Anomie: [C: 03+2] Set actor migration to read-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:36:59] (03Merged) 10jenkins-bot: Set actor migration to read-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:40:25] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) p:05Triage→03Normal a:03Papaul [13:43:31] wikibugs went down? [13:43:33] (03PS1) 10Vgutierrez: localssl: Fix missing acme_certname in nginx template [puppet] - 10https://gerrit.wikimedia.org/r/501319 [13:43:38] uff that's some lag [13:43:59] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [13:44:23] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501320 [13:44:25] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501320 (owner: 10Ladsgroup) [13:44:43] I'm deploying this [13:44:57] (03CR) 10jenkins-bot: Set actor migration to read-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501000 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:46:17] (03CR) 10Gehel: [C: 03+1] "I'm not up to speed on the acme stuff, so not entirely sure if all is good for cloudelastic servers. The changes to elastic / relforge loo" [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:46:17] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 37s) [13:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:23] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501320 (owner: 10Ladsgroup) [13:46:25] (03CR) 10Alex Monk: [C: 03+1] localssl: Fix missing acme_certname in nginx template [puppet] - 10https://gerrit.wikimedia.org/r/501319 (owner: 10Vgutierrez) [13:46:39] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [13:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] (03PS2) 10Kosta Harlan: Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) [13:47:14] (03CR) 10Kosta Harlan: Enable ORES RCFilters for eswikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) (owner: 10Kosta Harlan) [13:51:32] (03CR) 10Vgutierrez: [C: 04-1] "do not merge it till https://gerrit.wikimedia.org/r/c/operations/puppet/+/501319 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:53:09] (03PS7) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [13:55:22] (03PS8) 10Vgutierrez: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [13:56:19] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-psi-https_9643: Servers elastic2043.codfw.wmnet, elastic2036.codfw.wmnet, elastic2032.codfw.wmnet, elastic2039.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2054.codfw.wmnet, elastic2035.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:56:29] PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9643/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9643): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:56:58] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 695 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:56:59] Huh [13:57:18] looking [13:57:24] <_joe_> hey what's going on? [13:57:25] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - search-psi-https_9643: Servers elastic2043.codfw.wmnet, elastic2036.codfw.wmnet, elastic2049.codfw.wmnet, elastic2032.codfw.wmnet, elastic2033.codfw.wmnet, elastic2029.codfw.wmnet, elastic2053.codfw.wmnet, elastic2035.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:57:26] * apergos peeks in [13:57:34] <_joe_> should we switch to eqiad? [13:57:35] wut? [13:57:36] checking impact [13:57:47] we're already on eqiad [13:57:53] <_joe_> ok [13:57:58] clsuter reboot in progress, but it should not page [13:58:01] <_joe_> so no user-facing impact, correct? [13:58:03] ah, ok [13:58:03] <_joe_> ok [13:58:07] correct [13:58:07] good whew [13:58:12] * _joe_ closes incident [13:58:17] so lesser priority [13:58:20] *sigh* [13:58:23] yep [13:58:26] (03CR) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501320 (owner: 10Ladsgroup) [13:58:39] master re-election is taking longer than expected during master reboot [13:58:48] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.22 [software/spicerack] - 10https://gerrit.wikimedia.org/r/501204 (owner: 10Volans) [13:59:35] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Aklapper) Yann, this task is about detecting large regressions and outages. Pleas... [13:59:38] wow jenkins that took a while [14:00:11] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:00:13] RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-psi-codfw: number_of_in_flight_fetch: 0, number_of_data_nodes: 14, cluster_name: production-search-psi-codfw, active_shards_percent_as_number: 85.70990493713585, active_shards: 2795, unassigned_shards: 466, relocating_shards: 0, task_max_waiting_in_queue_millis: 0, status: yellow, number_of_nodes: [14:00:13] lse, initializing_shards: 0, delayed_unassigned_shards: 466, active_primary_shards: 1087, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:00:43] 3 minutes for master re-election, that's strange. Looking into it [14:00:50] RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 678 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:01:19] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:01:56] (03CR) 10Vgutierrez: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15583/ is happy, showing the expected NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/501319 (owner: 10Vgutierrez) [14:02:46] 10Operations, 10Packaging, 10Patch-For-Review: Add security apt security suites to pbuilder base images - https://phabricator.wikimedia.org/T220003 (10jbond) p:05Triage→03Normal [14:03:39] (03PS2) 10BBlack: Shortener VCL validation fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:03:49] 10Operations, 10Patch-For-Review: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10jbond) 05Open→03Resolved [14:04:43] elastic2048 was one of the master eligible nodes and is down with disk issues [14:04:53] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks really happy now: https://puppet-compiler.wmflabs.org/compiler1002/15585/" [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [14:04:58] we need to promote another node to master eligible [14:06:17] (03PS1) 10Gehel: elasticsearch: replace elastic2048 with 2049 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/501324 [14:06:22] (03CR) 10Effie Mouzeli: [C: 03+2] lvs: Use the kubernetes cluster for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [14:06:33] gehel: should we add a check to the cookbook that ensure the eligeble masters are up before starting? [14:06:34] (03CR) 10Effie Mouzeli: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/15584/lvs1016.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [14:07:27] volans: maybe, but more than that we need to ensure that we do promote new master eligible when one is down [14:07:41] (03PS2) 10Effie Mouzeli: lvs: Use the kubernetes cluster for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [14:07:42] so more an icinga check I guess [14:07:50] yep, makes more sense [14:07:54] +1 [14:08:08] this isn't a problem just related to a rolling restart, we could loose one of those server anytime [14:08:53] (03CR) 10Mathew.onipe: [C: 03+1] elasticsearch: replace elastic2048 with 2049 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/501324 (owner: 10Gehel) [14:09:09] (03CR) 10Gehel: [C: 03+2] elasticsearch: replace elastic2048 with 2049 as master eligible [puppet] - 10https://gerrit.wikimedia.org/r/501324 (owner: 10Gehel) [14:09:17] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 38 probes of 442 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:10:51] uhm interesting [14:11:25] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [14:11:29] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [14:11:45] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [14:11:47] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [14:12:05] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [14:12:15] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [14:12:15] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [14:12:25] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [14:12:27] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [14:12:45] (03PS3) 10BBlack: Shortener VCL validation fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:14:33] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 442 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [14:15:29] XioNoX: what do you usually look at for such events from RIPE Atlas? [14:16:02] you mean apart what's in the wiki linked? :-P [14:17:24] yes [14:17:29] I looked there [14:18:04] wait, are you telling me that the documentation doens't cover 100% of possible cases? :-P [14:18:21] I know, it is shocking [14:20:07] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.391e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:20:30] buuuuu [14:20:43] I bet that this is related to the rolling restart [14:20:58] yep CirrusSearch [14:21:06] I guess the usual issue when the cluster is restarted [14:21:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] pbuilder: add security updates repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond) [14:23:02] !log Depooling scb* from service cxserver traffic [14:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:16] !log jiji@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=cxserver,cluster=scb,name=scb.* [14:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:33] !log jiji@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,service=cxserver,cluster=scb,name=scb.* [14:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:57] (03PS9) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [14:27:09] RIPE Atlas alerts aren't necessarily actionable, but they can be indicators of broader routing issues between us and the rest of the Internet if they go over the heuristic thresholds of expected failures.... It should probably at least have a line of documentation or something, but can wait for X to do that properly [14:27:12] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [14:29:10] (03PS10) 10Mathew.onipe: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) [14:30:35] (03PS4) 10Jbond: pbuilder: add security updates repository [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) [14:34:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond) [14:39:07] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) @Gehel the IDRAC is not showing any failed disk. Can you from the OS pull up the log showing the failed disk. Thanks. [14:39:23] (03CR) 10Jbond: pbuilder: add security updates repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501163 (https://phabricator.wikimedia.org/T220003) (owner: 10Jbond) [14:41:02] (03CR) 10Effie Mouzeli: [C: 04-1] "For some reason, scb* cluster was pooled again, until we figure out why, we should not merge this change" [puppet] - 10https://gerrit.wikimedia.org/r/496382 (https://phabricator.wikimedia.org/T213195) (owner: 10Alexandros Kosiaris) [14:41:05] (03PS1) 10Ema: hieradata/labs: add wikibase monitoring flag [puppet] - 10https://gerrit.wikimedia.org/r/501331 (https://phabricator.wikimedia.org/T213705) [14:42:11] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10bd808) >>! In T218024#5084883, @aborrero wrote: >>>! In T218024#5049770, @bd808 wrote: >> I seem to have not accounted f... [14:46:20] (03PS2) 10Ema: hieradata/labs: add wikibase monitoring flag [puppet] - 10https://gerrit.wikimedia.org/r/501331 (https://phabricator.wikimedia.org/T213705) [14:46:28] (03CR) 10Volans: "Some more findings inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [14:47:04] (03CR) 10Reedy: wikitech: Disable Phabricator accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [14:48:06] (03PS4) 10BBlack: Shortener VCL validation fixups [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:48:29] (03CR) 10Ladsgroup: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:48:31] (03CR) 10Ema: [C: 04-1] Shortener VCL validation fixups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501032 (https://phabricator.wikimedia.org/T219986) (owner: 10Ladsgroup) [14:49:58] (03CR) 10Vgutierrez: [C: 03+1] hieradata/labs: add wikibase monitoring flag [puppet] - 10https://gerrit.wikimedia.org/r/501331 (https://phabricator.wikimedia.org/T213705) (owner: 10Ema) [14:50:56] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Gehel) ` ehel@elastic2048:~$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda1[0](F) sdb1[1] 29279232 blocks super 1.2 [2/1... [14:53:56] cdanis: did you look at https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts ? (the link tied to the alert is a bit wrong) [14:54:36] (03CR) 10Ema: [C: 03+2] hieradata/labs: add wikibase monitoring flag [puppet] - 10https://gerrit.wikimedia.org/r/501331 (https://phabricator.wikimedia.org/T213705) (owner: 10Ema) [14:55:49] ahh I missed that one [14:55:53] (03PS5) 10Andrew Bogott: puppet-compiler: restore the ability to export facts without puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/499007 (https://phabricator.wikimedia.org/T219430) [14:57:21] (03CR) 10Andrew Bogott: [C: 03+2] puppet-compiler: restore the ability to export facts without puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/499007 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [14:57:40] (03PS1) 10CDanis: fix notes_url for RIPE Atlas ping alerts [puppet] - 10https://gerrit.wikimedia.org/r/501334 [14:58:00] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Gehel) From syslog: ` Apr 4 06:25:04 elastic2048 kernel: [10370894.621036] sd 2:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK Apr 4 06:25:04 elastic2048 kernel: [10370894.... [14:58:55] (03PS12) 10CRusnov: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [14:59:25] !log installing libdatetime-timezone-perl updates [14:59:29] (03CR) 10Bstorm: "Thanks for the patch, Meshvogel! It seems there's a bit of history to pick through here, and I agree with Jcrespo about it potentially ne" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [14:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:56] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: Migrate all metrics originated by PoPs from statsd to Prometheus - https://phabricator.wikimedia.org/T220116 (10fgiunchedi) [15:00:36] (03CR) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [15:03:47] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=0) [15:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:21] (03PS3) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) [15:12:54] cdanis: let me know if it makes sens and if you see anyhing I could improve. But indeed, those alerts have never really been actionable [15:13:10] I'm fixing the notes_url so the alert points to the right place [15:13:24] aside from that I'm not yet sure what to do, but I'm thinking about it [15:13:34] (03PS7) 10Andrew Bogott: support multiple facts dirs [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) [15:13:45] (03PS2) 10CDanis: fix notes_url for RIPE Atlas ping alerts [puppet] - 10https://gerrit.wikimedia.org/r/501334 [15:14:33] (03CR) 10CDanis: [C: 03+2] fix notes_url for RIPE Atlas ping alerts [puppet] - 10https://gerrit.wikimedia.org/r/501334 (owner: 10CDanis) [15:14:58] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) [15:15:00] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: Migrate all metrics originated by PoPs from statsd to Prometheus - https://phabricator.wikimedia.org/T220116 (10fgiunchedi) [15:15:02] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [15:15:16] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10RolandUnger) wikivoyage-old.org was used only at the transfer process of the Wikivoyage dom... [15:16:21] (03PS1) 10Hoo man: WikibaseClient: Conditionally enabled mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) [15:16:47] (03PS4) 10Dzahn: confd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) [15:17:00] (03PS4) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) [15:18:01] (03PS2) 10Hoo man: WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) [15:18:56] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10greg) [15:19:19] (03CR) 10jerkins-bot: [V: 04-1] Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [15:21:35] thanks! [15:21:53] one option would be to raise the alert threshold much more [15:22:16] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [15:24:36] (03CR) 10Reedy: wikitech: Disable Phabricator accounts when blocked on wikitech (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [15:24:38] (03CR) 10Jcrespo: [C: 04-1] "This patch seems wrong, it raises an exception if running as root." [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [15:25:13] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for mongodb [puppet] - 10https://gerrit.wikimedia.org/r/501336 (https://phabricator.wikimedia.org/T135991) [15:28:43] (03PS1) 10Dzahn: druid: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/501337 [15:29:06] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) Thanks for the confirmation @RolandUnger . Appreciate it and will remove it. [15:29:26] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) [15:31:27] (03PS3) 10TheAnarcat: allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 [15:32:24] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) @RolandUnger @DerFussi Just to make sure, this also includes the email aliases we on... [15:33:23] (03CR) 10TheAnarcat: "indeed, fixed that." [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [15:38:21] (03CR) 10Jcrespo: "That seems more reasonable-- but obviously @Volans is the person to ask if that approach is ok." [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [15:38:51] !log rolling restart of proton to pick up openssl security update [15:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:16] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10RobH) Also: ` robh@elastic2048:~$ sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Dec 4 16:44:41 2018 Raid Level : raid1 Array Size : 29279232 (27.92 GiB 29.98... [15:41:13] (03PS1) 10Dzahn: exim: remove wikivoyage-old.org as a wikimedia email domain [puppet] - 10https://gerrit.wikimedia.org/r/501340 (https://phabricator.wikimedia.org/T219867) [15:41:56] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10RobH) Please note this system is out of warranty and any disk swaps will need to be accomplished with on site spares. [15:41:57] (03CR) 10Dzahn: [C: 03+2] exim: remove wikivoyage-old.org as a wikimedia email domain [puppet] - 10https://gerrit.wikimedia.org/r/501340 (https://phabricator.wikimedia.org/T219867) (owner: 10Dzahn) [15:42:13] !log depooling kafka2001 for eventbus security updates [15:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) Create Dispatch: Success You have successfully submitted request SR988826339. Your dispatch request has been successfully created and will be reviewed by our team. You can monitor its progress on yo... [15:43:05] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10RolandUnger) Yes, this should also include email aliases. The wikivoyage-ev.org addresses a... [15:45:52] 10Operations, 10Domains, 10Traffic, 10serviceops, 10Patch-For-Review: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) Ack, removing wikivoyage-old.org for email and web, not touching wikivoyage.org in a... [15:48:38] XioNoX: I was thinking about either raising the threshold, or requiring the time window to be longer [15:49:16] hrmm.. puppet alert on mx1001 will be me [15:49:32] removed wikivoyage-old.org domain from exim aliases [15:49:49] fix on the way [15:50:05] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10RobH) p:05Triage→03Normal [15:50:10] (03CR) 10Nuria: [C: 03+1] admin: add gpu-users group and assign it to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/501156 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [15:50:12] 6 # mail aliases (mchenry) [15:50:15] lol @ mchenry [15:51:16] (03CR) 10Jcrespo: allow running cumin as a regular user (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (owner: 10TheAnarcat) [15:51:41] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10RobH) The new policies for shell access have all group additions approved by their service owners. It is my understanding that @nuria will be approving for any analytics grou... [15:51:53] 10Operations, 10SRE-Access-Requests: analytics-wmde group addition for Lucas Werkmeister - https://phabricator.wikimedia.org/T220084 (10RobH) a:03Nuria [15:52:16] (03PS2) 10Dzahn: park wikivoyage-old.org [dns] - 10https://gerrit.wikimedia.org/r/500978 (https://phabricator.wikimedia.org/T219867) [15:52:23] !log pooling kafka2001 eventbus [15:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:52] !log depooling kafka2002 for eventbus security updates [15:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:21] (03PS1) 10Joal: Update AQS druid datasource to 2019-03 snapshot [puppet] - 10https://gerrit.wikimedia.org/r/501341 [15:55:01] (03CR) 10Dzahn: [C: 03+2] "it has been confirmed by Roland Unger of Wikivoyage e.V. that this can be removed entirely" [dns] - 10https://gerrit.wikimedia.org/r/500978 (https://phabricator.wikimedia.org/T219867) (owner: 10Dzahn) [15:55:07] !log repooling kafka2002 eventbus [15:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:37] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10MoritzMuehlenhoff) Is anyone still using Servermon at this point? [15:56:08] !log depooling kafka2003 for eventbus security updates [15:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:43] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10akosiaris) I can say I haven't in a pretty long time. If @faidon also doesn't I think we can shut it down. [15:59:14] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:59:37] !log wikivoyage-old.org domain has been retired and deactivated (T219867, T81727) [15:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:43] T219867: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 [15:59:43] T81727: DNS for wikivoyage-old.org - https://phabricator.wikimedia.org/T81727 [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 7967 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:00:54] (03CR) 10Krinkle: [C: 03+1] "Nice :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [16:01:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501336 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:01:25] !log repooling kafka2003 eventbus [16:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:20] (03PS13) 10CRusnov: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [16:04:28] (03CR) 10Vgutierrez: [C: 03+2] cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:04:37] (03PS11) 10Vgutierrez: cloudelastic: use acme_chief to get ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/501158 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:05:43] (03PS1) 10Filippo Giunchedi: sessionstore: add and use 'sessionstore' cluster [puppet] - 10https://gerrit.wikimedia.org/r/501345 (https://phabricator.wikimedia.org/T219523) [16:06:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] LLDP fact - return correct port information [puppet] - 10https://gerrit.wikimedia.org/r/500795 (owner: 10Ayounsi) [16:06:22] (03PS2) 10Alexandros Kosiaris: LLDP fact - return correct port information [puppet] - 10https://gerrit.wikimedia.org/r/500795 (owner: 10Ayounsi) [16:08:22] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501345 (https://phabricator.wikimedia.org/T219523) (owner: 10Filippo Giunchedi) [16:08:39] (03CR) 10Filippo Giunchedi: [C: 03+2] sessionstore: add and use 'sessionstore' cluster [puppet] - 10https://gerrit.wikimedia.org/r/501345 (https://phabricator.wikimedia.org/T219523) (owner: 10Filippo Giunchedi) [16:08:42] (03PS2) 10Filippo Giunchedi: sessionstore: add and use 'sessionstore' cluster [puppet] - 10https://gerrit.wikimedia.org/r/501345 (https://phabricator.wikimedia.org/T219523) [16:11:09] akosiaris: ok to merge your change too? [16:11:56] υθπ [16:11:58] yup [16:11:59] thanks! [16:13:28] (03PS5) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [16:13:36] akosiaris: np! [16:14:41] (03CR) 10BryanDavis: "The puppet side of this is merged and working, so in theory this can go live as soon as folks are happy with the implementation." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [16:17:21] (03PS1) 10Alex Monk: Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501346 [16:17:42] (03PS1) 10Vgutierrez: Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501347 [16:18:32] (03Abandoned) 10Vgutierrez: Revert "profile::cache::ssl::wikibase: Simplify" [puppet] - 10https://gerrit.wikimedia.org/r/501347 (owner: 10Vgutierrez) [16:18:48] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Session storage Cassandra metrics (Prometheus) not being collected - https://phabricator.wikimedia.org/T219523 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We're on: (more targets will appear as puppet runs) ` root@prometheus1003:/srv/prometheus/... [16:18:50] 10Operations, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 4 others: Session storage Cassandra cluster configuration - https://phabricator.wikimedia.org/T215883 (10fgiunchedi) [16:21:43] somebody is running a puppet compiler facts update as we speak? [16:26:20] PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:26:26] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [16:27:14] PROBLEM - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 5000: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:27:14] PROBLEM - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: connect to address 208.80.153.59 and port 35357: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:31:43] !log beginning rolling kafka restarts on kafka200[123] for security updates [16:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:11] (03CR) 10BryanDavis: "This can go live as soon as Wikitech has 1.33.0-wmf.24 deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [16:34:57] (03PS3) 10BryanDavis: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) [16:36:41] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/15592/" [puppet] - 10https://gerrit.wikimedia.org/r/501346 (owner: 10Alex Monk) [16:37:06] (03Abandoned) 10BryanDavis: openldap: Set default password policy [puppet] - 10https://gerrit.wikimedia.org/r/497684 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [16:46:04] (03PS4) 10Jcrespo: Allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (https://phabricator.wikimedia.org/T218440) (owner: 10TheAnarcat) [16:50:09] 10Operations, 10Horizon: Reset 2FA for Horizon for user ppchelko - https://phabricator.wikimedia.org/T220128 (10Pchelolo) [16:52:44] (03CR) 10BryanDavis: "> How was this working in the main deployment?" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [16:53:55] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Volans) And when we do, can we also drop the `package_updates` custom fact? [16:54:41] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) With the changes in packages now trying to run any model... [16:54:55] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [16:55:12] 10Operations, 10Horizon: Reset 2FA for Horizon for user ppchelko - https://phabricator.wikimedia.org/T220128 (10Krenair) It's the same 2FA for wikitech, perhaps you have your recovery tokens? [16:55:26] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to cloudweb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [16:55:46] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:55:48] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [16:57:47] (03PS1) 10Vgutierrez: ssl::wikibase: Fix le_subjects hieradata key name [puppet] - 10https://gerrit.wikimedia.org/r/501357 [16:58:12] (03CR) 10jerkins-bot: [V: 04-1] Allow running cumin as a regular user [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (https://phabricator.wikimedia.org/T218440) (owner: 10TheAnarcat) [16:58:30] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) A solution could be to remove `mivisionx` (not sure if needed)... [16:59:00] 10Operations, 10Horizon: Reset 2FA for Horizon for user ppchelko - https://phabricator.wikimedia.org/T220128 (10Pchelolo) Unfortunately, no. [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1700). Please do the needful. [17:01:42] (03CR) 10Ema: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/501346 (owner: 10Alex Monk) [17:01:48] (03CR) 10Ema: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/501357 (owner: 10Vgutierrez) [17:02:53] (03CR) 10Urbanecm: [C: 03+1] Publish throttle-analyze at noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481267 (https://phabricator.wikimedia.org/T187894) (owner: 10Framawiki) [17:02:55] 10Operations, 10Horizon: Reset 2FA for Horizon for user ppchelko - https://phabricator.wikimedia.org/T220128 (10Krenair) As you're a deployer I imagine you could follow https://wikitech.wikimedia.org/wiki/Password_reset#Wikimedia_or_wikitech_two_factor_authentication_removal and reset it yourself from the serv... [17:03:52] 10Operations, 10Analytics, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10fdans) p:05Triage→03Normal [17:03:57] (03CR) 10Jcrespo: "Aside from the comment, also unit tests would need update. Although I would wait for informed feedback first." [software/cumin] - 10https://gerrit.wikimedia.org/r/497312 (https://phabricator.wikimedia.org/T218440) (owner: 10TheAnarcat) [17:06:54] (03PS1) 10Ema: WIP: role::cache::upload: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 [17:07:06] 10Operations, 10wikitech.wikimedia.org: Reset 2FA for Horizon for user ppchelko - https://phabricator.wikimedia.org/T220128 (10JJMC89) [17:07:24] 10Operations, 10wikitech.wikimedia.org: Reset 2FA for Horizon for user ppchelko - https://phabricator.wikimedia.org/T220128 (10Pchelolo) 05Open→03Resolved a:03Pchelolo Oh, awesome. Thank you! [17:07:51] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10fdans) p:05Triage→03Normal [17:07:53] (03PS1) 10CDanis: monitoring hostgroups: rename 'sessions' to 'sessionstore' [puppet] - 10https://gerrit.wikimedia.org/r/501361 [17:08:30] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move eventstreams logging to new logging pipeline - https://phabricator.wikimedia.org/T219922 (10fdans) p:05Triage→03Normal [17:08:33] (03CR) 10jerkins-bot: [V: 04-1] WIP: role::cache::upload: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (owner: 10Ema) [17:09:05] (03PS2) 10CDanis: monitoring hostgroups: rename 'sessions' to 'sessionstore' [puppet] - 10https://gerrit.wikimedia.org/r/501361 (https://phabricator.wikimedia.org/T219523) [17:09:55] (03CR) 10CDanis: [C: 03+2] monitoring hostgroups: rename 'sessions' to 'sessionstore' [puppet] - 10https://gerrit.wikimedia.org/r/501361 (https://phabricator.wikimedia.org/T219523) (owner: 10CDanis) [17:10:40] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.rolling-reboot [17:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:00] (03PS3) 10Kosta Harlan: Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) [17:13:36] (03PS2) 10Ema: WIP: role::cache::upload: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 [17:14:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: role::cache::upload: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (owner: 10Ema) [17:15:56] (03PS3) 10Ema: WIP: role::cache::upload: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 [17:16:05] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:16:07] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10akosiaris) >>! In T198939#5085736, @Volans wrote: > And when we do, can we also drop the `package_updates` custom fact? Sure. [17:17:00] (03CR) 10jerkins-bot: [V: 04-1] WIP: role::cache::upload: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (owner: 10Ema) [17:17:36] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work), and 2 others: Create checks that alerts on cirrussearch update lags - https://phabricator.wikimedia.org/T219601 (10EBernhardson) [17:17:44] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work), and 2 others: Create checks that alerts on cirrussearch update lags - https://phabricator.wikimedia.org/T219601 (10EBernhardson) p:05Triage→03Normal [17:26:23] (03PS2) 10Bstorm: osmdb: set the CNAME for osmdb to the new instance in Cloud VPS [dns] - 10https://gerrit.wikimedia.org/r/500086 (https://phabricator.wikimedia.org/T219652) [17:26:32] (03CR) 10jerkins-bot: [V: 04-1] osmdb: set the CNAME for osmdb to the new instance in Cloud VPS [dns] - 10https://gerrit.wikimedia.org/r/500086 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [17:28:24] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.rolling-reboot (exit_code=97) [17:28:24] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:12] !log killing ongoing backup at dbprov2002, stuck [17:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:18] (03PS3) 10Bstorm: osmdb: set the CNAME for osmdb to the new instance in Cloud VPS [dns] - 10https://gerrit.wikimedia.org/r/500086 (https://phabricator.wikimedia.org/T219652) [17:29:23] (03PS1) 10Urbanecm: Create uploader user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501363 (https://phabricator.wikimedia.org/T216615) [17:30:57] (03CR) 10Bstorm: [C: 03+2] osmdb: set the CNAME for osmdb to the new instance in Cloud VPS [dns] - 10https://gerrit.wikimedia.org/r/500086 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [17:33:07] !log stopping replication on dbstore2001:s8 for backup testing T206203 [17:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:11] T206203: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 [17:40:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 (10Andrew) [17:40:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:41:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [17:41:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10Andrew) [17:41:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:43:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Checklist for moving a cloudvirt from 1G to 10G: [] - put system offline in all checks for maint window [] - relocate to 10G rack and up... [17:43:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1018 to a 10G rack, connect 10G nics - https://phabricator.wikimedia.org/T217347 (10RobH) [17:43:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate and reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10RobH) [17:44:04] 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10Andrew) [17:44:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:45:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10RobH) [17:45:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1018 with 10G interfaces - https://phabricator.wikimedia.org/T217347 (10RobH) [17:45:47] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@f69dc9c]: Switch to new logging infrastructure T211125 [17:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:53] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [17:46:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10RobH) [17:47:04] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) per https://github.com/RadeonOpenCompute/ROCm/issues/703... [17:47:31] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@f69dc9c]: Switch to new logging infrastructure T211125 (duration: 01m 44s) [17:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:20] https://phabricator.wikimedia.org/T218089#5085939 [17:51:36] (03CR) 10Volans: [C: 03+2] "Nice!" [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [17:51:56] once is a minor issue, but if recurring to several files, it is a problem [17:55:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216324 (10RobH) [17:55:21] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@922cbc0]: Switch to new logging infrastructure T211125 [17:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:25] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [17:57:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 (10Andrew) [17:58:24] (03PS1) 10Andrew Bogott: cloudvirt1008: disable notifications during rebuild [puppet] - 10https://gerrit.wikimedia.org/r/501368 [17:59:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 (10RobH) So when attempting to setup PXE on this, the network device 2 (ie the 10G interface) isn't showing as a bo... [17:59:24] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@922cbc0]: Switch to new logging infrastructure T211125 (duration: 04m 03s) [17:59:25] !log stopped postgresql on labsdb1006.eqiad.wmnet and moved the database master functionality (and all rsyncs) to clouddb1003.clouddb-services.eqiad.wmflabs [17:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:02] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) ChangeProp and JobQueue ChangeProp has been moved to the new logging infra as well. [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1800). [18:00:04] dmaza and kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:15] (03Merged) 10jenkins-bot: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [18:00:33] (03PS3) 10CRusnov: puppetdb_microservice: Redo how it returns values [puppet] - 10https://gerrit.wikimedia.org/r/501104 [18:00:34] I can SWAT [18:01:12] here [18:01:17] (starting in 5 mins or so) [18:01:25] (03CR) 10Catrope: [C: 03+2] Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) (owner: 10Kosta Harlan) [18:02:40] (03Merged) 10jenkins-bot: Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) (owner: 10Kosta Harlan) [18:06:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:06:41] (03CR) 10Mobrovac: "kk great! Alex, Filippo, let's move on this next week?" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [18:06:55] (03PS1) 10Muehlenhoff: snapshot: Add hhvm to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/501370 [18:07:03] here [18:08:37] (03CR) 10jenkins-bot: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [18:12:34] (03CR) 10Alaa Sarhan: [C: 03+1] WikibaseClient: Conditionally enable mapframe support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501335 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [18:12:42] kostajh: OK your patch is live on mwdebug1002, please test [18:12:46] :P [18:12:55] I'm pinged instead of Alaa Sarhan :) [18:13:25] !log restarted apache on people.wikimedia.org to pick up OpenSSL update [18:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:30] RoanKattouw: looking [18:14:10] (03PS1) 10Bstorm: postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) [18:15:18] RoanKattouw: none of the ORES filters return any edits [18:15:24] Yeah I saw that too [18:15:27] https://es.wikiquote.org/wiki/Especial:ORESModels looks correct though [18:15:33] (03PS2) 10Bstorm: postgresql: set recovery.conf to writeable by postgres user [puppet] - 10https://gerrit.wikimedia.org/r/501371 (https://phabricator.wikimedia.org/T219652) [18:15:40] I'll try running the population script again, it was giving me strange errors [18:15:46] I wonder if you need to run the script to load historic scores. [18:15:53] We might have to try deploying this then running the population script [18:15:59] Gotcha. Let me know what those errors are and I'll try to help. [18:16:23] Oh here we go, now it's working [18:16:48] oh yeah. I see that too [18:17:57] (03CR) 10jenkins-bot: Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) (owner: 10Kosta Harlan) [18:19:08] (03CR) 10TheAnarcat: "oh wow this was merged already! was the documentation updated to give examples of how that works?" [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [18:20:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:21:02] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [18:21:36] (03CR) 10Volans: "> Patch Set 13:" [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [18:22:01] OK this looks good enough that I'm going to deploy it now. The population script finished without errors [18:22:31] RoanKattouw: thanks, sounds good to me [18:23:22] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ORES RCFilters on eswikiquote (T219160) (duration: 01m 02s) [18:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:25] T219160: Enable ORES filters for Spanish Wikiquote - https://phabricator.wikimedia.org/T219160 [18:23:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 (10RobH) [18:23:42] \o/ [18:24:49] Alright, next up is dmaza [18:24:54] :) [18:25:14] (03CR) 10Catrope: [C: 03+2] Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [18:25:46] (03CR) 10Volans: [C: 04-1] "There are a still a couple of things not completely fixed since last PS comments and I've added a couple of questions/comments too." (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [18:27:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10RobH) [18:28:34] (03PS2) 10Catrope: Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [18:28:41] (03CR) 10Catrope: Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [18:28:47] (03CR) 10Catrope: [C: 03+2] Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [18:28:49] 10Operations, 10ops-eqiad, 10DC-Ops: cloudvirt1015: update raid config and move to 10Gb - https://phabricator.wikimedia.org/T217140 (10RobH) a:03Cmjohnson [18:29:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Move cloudvirt1012 to a 10G rack and connect 10g nics - https://phabricator.wikimedia.org/T217346 (10RobH) a:03Cmjohnson [18:29:51] (03Merged) 10jenkins-bot: Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [18:30:07] (03CR) 10jenkins-bot: Enable Partial Blocks on French and Polish wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501044 (https://phabricator.wikimedia.org/T219327) (owner: 10Dmaza) [18:30:45] dmaza: Live on mwdebug1002, please test [18:30:50] checking [18:31:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1009 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10RobH) a:03Cmjohnson [18:31:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10RobH) [18:32:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10RobH) [18:32:04] RoanKattouw: looks good here [18:32:49] 10Operations, 10ops-eqiad, 10DC-Ops: relocate/reimage cloudvirt1015 with 10G interfaces - https://phabricator.wikimedia.org/T217140 (10RobH) [18:32:58] (03CR) 10Volans: [C: 03+1] "LGTM, its deployment need to be coordinated with the related changes in the Netbox report too." [puppet] - 10https://gerrit.wikimedia.org/r/501104 (owner: 10CRusnov) [18:33:26] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable partial blocks on frwiki, plwiki (T219327, T219218) (duration: 00m 58s) [18:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:31] T219327: Deploy partial blocks on Polish wikipedia - https://phabricator.wikimedia.org/T219327 [18:33:32] T219218: Deploy partial blocks to French Wikipedia - https://phabricator.wikimedia.org/T219218 [18:33:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10RobH) [18:34:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10RobH) [18:34:39] OK, SWAT done [18:34:47] thank you [18:35:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10RobH) [18:45:23] (03PS1) 10Bstorm: postgresql: set max_wal_senders on slave conf [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) [18:47:45] (03PS8) 10CRusnov: Break report into parts and adjust the way devices are filtered [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 [18:48:00] (03CR) 10CRusnov: Break report into parts and adjust the way devices are filtered (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/499245 (owner: 10CRusnov) [18:54:42] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm) p:05Triage→03Normal [18:55:13] 10Operations, 10Data-Services, 10decommission, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm) [18:55:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1012 with 10G interfaces - https://phabricator.wikimedia.org/T217346 (10RobH) Ok, I've attempted to update the firmware of the ilom multiple times, all to no avail. The method I've attempted to use for cloudvirt... [19:00:04] marxarelli: Dear deployers, time to do the MediaWiki train - Americas version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T1900). [19:03:30] !log preparing to promote 1.33.0-wmf.24 to group1 [19:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:34] !log fetch/rebase looks good, incorporates fixes for T220037, T219510. deploying [19:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:39] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [19:06:39] T219510: Citoid should only usurp " (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501388 [19:07:35] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501388 (owner: 10Dduvall) [19:08:40] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501388 (owner: 10Dduvall) [19:10:52] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24 [19:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:06] damn. seeing the same spike in DBTransactionError as yesterday... [19:12:39] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.24 (duration: 01m 46s) [19:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:13] !log large spike in DBTransactionError errors. rolling back. cc: T220037 [19:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:19] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [19:14:20] hmm... actually it seems a lesser spike than yesterday's. holding for now [19:14:54] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501388 (owner: 10Dduvall) [19:15:37] (03CR) 10Krinkle: "Indeed. These are left intentionally, similar ones exist for read-only mode elsewhere as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497477 (owner: 10Reedy) [19:15:56] (03CR) 10Krinkle: [C: 03+1] Remove dupe DB comments (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497477 (owner: 10Reedy) [19:16:10] spike isn't subsiding. plateauing rather. k, rolling back [19:16:25] (03CR) 10Reedy: "Indeed, and they were happy for them oto be removed (hence the patch)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497477 (owner: 10Reedy) [19:19:08] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.33.0-wmf.24" [19:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:39] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:21:01] (03PS1) 10Dduvall: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501392 [19:22:39] (03PS1) 10Jon Harald Søby: Add smn and sms to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501393 (https://phabricator.wikimedia.org/T220118) [19:24:20] (03CR) 10Dduvall: [C: 03+2] Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501392 (owner: 10Dduvall) [19:24:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:25:47] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501392 (owner: 10Dduvall) [19:26:15] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.24" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501392 (owner: 10Dduvall) [19:28:53] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:32:23] PROBLEM - HHVM rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:33:31] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 77271 bytes in 1.781 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:43:25] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/LdapAuthentication/LdapAuthentication.php: (no justification provided) (duration: 00m 59s) [19:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:24] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/LdapAuthentication/LdapAuthenticationHooks.php: (no justification provided) (duration: 00m 59s) [19:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:23] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/LdapAuthentication/LdapAuthenticationPlugin.php: (no justification provided) (duration: 00m 58s) [19:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:22] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/EventBus/includes/JobExecutor.php: (no justification provided) (duration: 00m 58s) [19:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:22] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/Citoid/modules/ve.ui.Citoid.init.js: (no justification provided) (duration: 00m 59s) [19:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:50] (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501396 [19:48:52] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501396 (owner: 10Dduvall) [19:50:00] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501396 (owner: 10Dduvall) [19:51:58] !log re-deploying to group1 after proper syncs [19:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:12] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24 [19:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:00] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.24 (duration: 01m 47s) [19:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:17] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Papaul) Dear Papaul, Your dispatch shipped on 4/4/2019 3:26 PM [19:55:47] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [19:58:11] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/EventBus/includes/JobExecutor.php: syncing JobExecutor changes (duration: 00m 58s) [19:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:54] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501396 (owner: 10Dduvall) [20:02:52] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/LdapAuthentication/LdapAuthentication.php: sync for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LdapAuthentication/+/500994 (duration: 00m 58s) [20:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:15] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:03:51] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/LdapAuthentication/LdapAuthenticationHooks.php: sync for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LdapAuthentication/+/500994 (duration: 00m 58s) [20:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:49] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/LdapAuthentication/LdapAuthenticationPlugin.php: sync for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LdapAuthentication/+/500994 (duration: 00m 57s) [20:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:46] (03CR) 10Kaldari: [C: 04-2] "We're freezing new deployments of Flow for the time-being, pending the outcome of the current Talk page consultations (per Danny)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497243 (https://phabricator.wikimedia.org/T119365) (owner: 10Gergő Tisza) [20:06:33] !log dduvall@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/Citoid/modules/ve.ui.Citoid.init.js: sync for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Citoid/+/501114 (duration: 00m 58s) [20:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:46] (03PS2) 10Bstorm: labstore: cleanup the remaining files after Icc89332f0e779 [puppet] - 10https://gerrit.wikimedia.org/r/501070 (https://phabricator.wikimedia.org/T209527) [20:09:35] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10Pchelolo) Apparently `graphoid` is still using service::node::conf... [20:11:08] (03PS1) 10Cwhite: grafana: update swift dashboard to use new metric names [puppet] - 10https://gerrit.wikimedia.org/r/501399 (https://phabricator.wikimedia.org/T219825) [20:11:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) [20:11:35] !log error rates look good after proper syncs and re-deploy. cc: T220037 [20:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:39] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [20:11:55] !log promoting 1.33.0-wmf.24 to all wikis [20:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:02] (03PS1) 10Dduvall: all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501401 [20:13:04] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501401 (owner: 10Dduvall) [20:14:10] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501401 (owner: 10Dduvall) [20:16:07] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.24 [20:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:55] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501401 (owner: 10Dduvall) [20:24:51] (03PS6) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [20:25:24] Pchelolo: let's move discussion here if you don't mind [20:25:43] (03CR) 10BryanDavis: "> Uploaded patch set 6." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [20:25:55] sure. AaronSchulz might be a better person to ask at this point - I've just ported his patch [20:25:57] the error rate looks much lower than before (100-300 errors per minute) [20:26:01] but that's still pretty high [20:26:33] i'll post what i'm seeing to https://phabricator.wikimedia.org/T220037 [20:26:36] AaronSchulz: ^ [20:30:27] trying to decide whether this merits another rollback... [20:30:40] thcipriani, could use your opinion ^ [20:30:55] * thcipriani looks [20:31:54] seeing ~ 800 of that transaction error per minute, 1/10th of what i saw before but it's still rather alarming [20:32:22] ~ 800 per minute that is, on average [20:32:35] 12,200 in the past 15 min [20:34:12] (03CR) 10Herron: [C: 03+1] "Looks good overall to me, one relatively minor thing inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [20:34:36] (03CR) 10Krinkle: [C: 03+1] wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [20:35:13] marxarelli: hrm, yeah, that is a rather high error rate and doesn't seem to be subsiding in the short time I've been watching it. Probably warrants a rollback. [20:35:46] alrighty [20:36:18] (03CR) 10Herron: [C: 03+1] "I'm up for giving this a try. Thanks for putting this together!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501039 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [20:36:36] (03CR) 10Herron: [C: 03+1] flake8 fixes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/501130 (owner: 10Andrew Bogott) [20:36:51] !log rolling back again following still high rates of DBTransactionError (avg ~ 800/min) [20:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:39] (03CR) 10Bstorm: "Confirmed to be NOOP on all labstore servers. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/501070 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:38:44] (03CR) 10Bstorm: [C: 03+2] labstore: cleanup the remaining files after Icc89332f0e779 [puppet] - 10https://gerrit.wikimedia.org/r/501070 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:41:21] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Revert "group2/group1 wikis to 1.33.0-wmf.24" [20:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:00] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) Hi All, This task is done. Thank you for patience throughout. The last Groups Migration... [20:45:10] !log promotion of 1.33.0-wmf.24 rolled back to group0 and holding. cc: T206678, T220037 [20:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:16] T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678 [20:45:19] T220037: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 [20:46:53] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:48:23] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) [20:50:03] jijiki: Thanks! [20:50:31] 10Operations, 10Continuous-Integration-Config: operations/puppet CI errors not being displayed in console log - https://phabricator.wikimedia.org/T214726 (10hashar) [20:50:52] James_F: :) [20:55:45] greg-g: fyi, train is holding at group0 [20:55:55] :( [20:58:41] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:03:31] (03PS3) 10BBlack: Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) [21:09:03] !log restarting eqiad ELK stack for security updates [21:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:12] (03CR) 10BBlack: [C: 03+2] Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [21:17:36] !log DNS deploying https://gerrit.wikimedia.org/r/c/operations/dns/+/500731 which can affect resolution of our CNAME records. If dns-related issues, can revert at will! [21:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:28] jouncebot: refresh [21:19:29] I refreshed my knowledge about deployments. [21:22:00] !log renumber AS58587 to AS10075 in eqsin [21:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:49] PROBLEM - MariaDB disk space on dbstore1001 is CRITICAL: DISK CRITICAL - free space: /srv 670069 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:26:55] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [21:27:51] (03CR) 10Krinkle: "here is output from using 'fab deploy_docker' which runs docker-pkg on contint1001 after pulling down a commit that updates 3 images that " [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [21:32:43] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review, 10User-zeljkofilipin: npm 6 consistently fails with "Z_DATA_ERROR: invalid distance too far back" on some repos - https://phabricator.wikimedia.org/T215562 (10Krinkle) 05Open→03Resolved [21:35:53] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10Aklapper) @eross: If there is nothing left to do, feel free to resolve this task via the {nav n... [21:38:34] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10eross) 05Open→03Resolved [21:45:13] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10eliza) @bmansurov - just an FYI - I could be the one member for now - then when other folks begin to populate the group, you may remove me. This way - I can... [21:48:11] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10bmansurov) @eliza thanks, that'd be great. [21:50:05] 10Operations, 10netops, 10Patch-For-Review: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056 (10ayounsi) Note that we peer with RIPE RIS collectors in out POPs, so people can use https://stat.ripe.net/widget/looking-glass as a looking glass. [21:51:30] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10eliza) @bmansurov - can you please reply to that ticket and I will confirm on that thread when this group has been completed. [21:54:45] (03PS5) 10Andrew Bogott: compiler-update-facts: better support addition of arbitrary fact sets [puppet] - 10https://gerrit.wikimedia.org/r/500501 (https://phabricator.wikimedia.org/T219430) [21:55:41] PROBLEM - puppet last run on dns2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:57:23] (03PS8) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [21:57:27] (03PS3) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [21:57:29] (03PS1) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501432 (https://phabricator.wikimedia.org/T218954) [21:58:42] (03Abandoned) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501432 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [21:59:05] (03PS3) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [21:59:22] (03PS9) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [22:01:07] (03PS10) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [22:01:09] (03PS4) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [22:01:13] (03PS4) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [22:05:11] (03PS11) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [22:05:13] (03PS5) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [22:05:17] (03PS5) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [22:06:15] RECOVERY - puppet last run on dns2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:07:18] (03PS2) 10Andrew Bogott: cloudvirt1008: disable notifications during rebuild [puppet] - 10https://gerrit.wikimedia.org/r/501368 [22:08:16] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1008: disable notifications during rebuild [puppet] - 10https://gerrit.wikimedia.org/r/501368 (owner: 10Andrew Bogott) [22:10:49] 10Operations, 10Office-IT, 10Wikimedia-Mailing-lists, 10CommRel-Specialists-Support (Jan-Mar-2019): Mailing list migration for Arbitration Committee to Google Group - https://phabricator.wikimedia.org/T215940 (10MarcoAurelio) [22:13:07] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [22:14:44] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [22:15:04] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [22:18:29] (03PS1) 10Bstorm: cloudstore: add extension and get nfs-manage-binds passing linter [puppet] - 10https://gerrit.wikimedia.org/r/501434 (https://phabricator.wikimedia.org/T209527) [22:28:39] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:34:24] (03PS12) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [22:34:26] (03PS6) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [22:34:28] (03PS6) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [22:39:40] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [22:40:07] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [22:40:22] (03PS1) 10Smalyshev: Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) [22:40:42] (03CR) 10jerkins-bot: [V: 04-1] Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) (owner: 10Smalyshev) [22:41:18] (03CR) 10BryanDavis: "Related larger cleanup patch has been merged. The test this was cleaning up is removed now." [puppet] - 10https://gerrit.wikimedia.org/r/500409 (owner: 10Muehlenhoff) [22:41:34] (03CR) 10Bstorm: "Compiler looks good https://puppet-compiler.wmflabs.org/compiler1002/15598/labstore1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/501434 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:41:45] (03PS2) 10Bstorm: cloudstore: add extension and get nfs-manage-binds passing linter [puppet] - 10https://gerrit.wikimedia.org/r/501434 (https://phabricator.wikimedia.org/T209527) [22:44:46] (03PS1) 10Smalyshev: Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) [22:45:07] (03CR) 10jerkins-bot: [V: 04-1] Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) (owner: 10Smalyshev) [22:45:37] (03PS2) 10Smalyshev: Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) [22:46:59] (03PS2) 10Smalyshev: Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) [22:47:21] (03CR) 10jerkins-bot: [V: 04-1] Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) (owner: 10Smalyshev) [22:49:36] (03CR) 10Bstorm: [C: 03+2] cloudstore: add extension and get nfs-manage-binds passing linter [puppet] - 10https://gerrit.wikimedia.org/r/501434 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:50:09] (03PS3) 10Smalyshev: Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) [22:52:31] (03CR) 10Jforrester: [C: 03+1] Enable WBCS search on commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501436 (https://phabricator.wikimedia.org/T218954) (owner: 10Smalyshev) [22:56:55] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190404T2300). [23:00:05] Jhs and bd808: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:31] present [23:01:07] (03CR) 10Jforrester: "I don't believe this change got sign-off from either me or Greg. :-( Please follow the process in future, it exists for a reason." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493409 (https://phabricator.wikimedia.org/T214905) (owner: 10WMDE-Fisch) [23:03:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:06:03] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) Synthetic benchmarks of runtime performance of CNN train... [23:07:44] no-one around for SWAT? :\ [23:13:11] * Jhs tries pinging MaxSem, twentyafterfour, dereckson and thcipriani one more time before going to bed [23:14:19] PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [23:14:44] that's the most technically an alert i've ever seen [23:15:26] lol [23:17:06] * apergos peeks in [23:18:46] Jhs: nobody helping you with SWAT? [23:19:06] bd808, nope :\ [23:19:40] I've got a patch up too, so maybe I can do it. Let me grab a glass of water and then I'll look at your patch [23:19:50] coolio :) [23:20:42] (03PS1) 10Bstorm: cloudstore: A bit more cleanup [puppet] - 10https://gerrit.wikimedia.org/r/501446 (https://phabricator.wikimedia.org/T209527) [23:21:03] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Krenair) >>! In T219764#5084520, @fgiunchedi wrote: >>>! In T219764#5073355, @Krenair wrote: >> This was on deployment-sca02 but the list of deployment-prep instances failing puppet grew sud... [23:25:02] Jhs: this config change looks pretty straight forward. Is it something that you can test with x-wikimedia-debug on one of the test servers? [23:25:35] bd808, not sure actually. The only way I know to confirm it works is to save an edit on Wikidata, but will that work on mwdebug? [23:26:13] yeah, if you have the browser extension installed you should be able to do that [23:26:21] I do [23:26:53] cool. Let me remember how to watch error logs adn then we will be ready to give things a shot [23:30:37] (03PS2) 10BryanDavis: Add smn and sms to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501393 (https://phabricator.wikimedia.org/T220118) (owner: 10Jon Harald Søby) [23:30:42] (03CR) 10BryanDavis: [C: 03+2] Add smn and sms to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501393 (https://phabricator.wikimedia.org/T220118) (owner: 10Jon Harald Søby) [23:31:50] (03Merged) 10jenkins-bot: Add smn and sms to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501393 (https://phabricator.wikimedia.org/T220118) (owner: 10Jon Harald Søby) [23:32:04] (03CR) 10jenkins-bot: Add smn and sms to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501393 (https://phabricator.wikimedia.org/T220118) (owner: 10Jon Harald Søby) [23:34:37] Jhs: ok, your change is staged on mwdebug1002 [23:35:30] bd808, it works :) [23:35:57] sweet! I don't see any errors so lets ship it [23:38:54] !log bd808@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:501393|Add smn and sms to wmgExtraLanguageNames]] (T220118) (duration: 01m 02s) [23:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:58] T220118: Add smn and sms to wmgExtraLanguageNames - https://phabricator.wikimedia.org/T220118 [23:39:12] Jhs: ^ should be live everywhere now [23:40:54] bd808: are you ready for your patch? [23:41:01] bd808: yes, I'm here [23:41:25] bd808: cool. Waiting on jerkins now [23:42:13] bd808, lol. and thank you very much :) [23:43:03] (03PS1) 10Bstorm: labstore: Adapt nfs-exportd to be used on more than one cluster [puppet] - 10https://gerrit.wikimedia.org/r/501451 (https://phabricator.wikimedia.org/T209527) [23:52:44] !log bd808@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/LdapAuthentication: SWAT: [[gerrit:501412|Also set an LDAP password policy on Block]] (T168692) (duration: 01m 01s) [23:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:48] T168692: Blocking an account on wikitech should disable LDAP logins - https://phabricator.wikimedia.org/T168692 [23:54:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:36] (03PS4) 10BryanDavis: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) [23:55:46] (03CR) 10BryanDavis: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [23:57:16] (03Merged) 10jenkins-bot: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [23:57:32] (03PS7) 10BryanDavis: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) [23:57:41] (03CR) 10BryanDavis: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis) [23:58:44] (03Merged) 10jenkins-bot: wikitech: Disable Phabricator accounts when blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501123 (https://phabricator.wikimedia.org/T218654) (owner: 10BryanDavis)