[00:00:05] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T0000). [00:00:41] twentyafterfour: πŸ‘‹ [00:02:09] ACKNOWLEDGEMENT - Check the Netbox report-s- -puppetdb- for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL Cas Rusnov pushing this forward for morning crew to see. - The acknowledgement expires at: 2019-05-31 06:00:42. https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:03:57] (03PS2) 10Dzahn: phabricator: remove role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513241 [00:04:43] (03PS3) 10Dzahn: phabricator: remove role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513241 [00:06:15] (03PS4) 10Dzahn: phabricator: remove role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513241 [00:08:32] (03PS5) 10Dzahn: phabricator: remove role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/513241 [00:12:03] 10Operations: reinstall RT server with private IP and stretch - https://phabricator.wikimedia.org/T180641 (10Dzahn) [00:12:05] 10Operations: Migrate ununpentium/RT to Stretch/Buster - https://phabricator.wikimedia.org/T224575 (10Dzahn) [00:12:48] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [00:13:55] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16804/" [puppet] - 10https://gerrit.wikimedia.org/r/513241 (owner: 10Dzahn) [00:17:37] !log rsyncing /srv/repos again. pulling on phab2001 from phab1003 (T221389) [00:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:42] T221389: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 [00:24:56] !log re-enabling puppet on phab1001 now that it does not have the phab role anymore (T221389) [00:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:01] T221389: setup/install WMF7426 as phab1003.eqiad.wmnet - https://phabricator.wikimedia.org/T221389 [00:31:34] (03CR) 10Dzahn: [C: 03+2] Fix passing ssh key on releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [00:31:43] (03PS2) 10Dzahn: Fix passing ssh key on releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [00:34:54] 10Operations, 10DBA: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) data consistency checks have finished for the main tables and it is all fine. This host can get repooled. [00:34:59] (03CR) 10Dzahn: "Notice: /Stage[main]/Jenkins::Slave/Ssh::Userkey[jenkins-slave]/File[/etc/ssh/userkeys/jenkins-slave]/content: content changed '" [puppet] - 10https://gerrit.wikimedia.org/r/513034 (owner: 10Hashar) [00:35:52] (03PS1) 10Marostegui: db2091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/513246 (https://phabricator.wikimedia.org/T224393) [00:36:36] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513247 [00:36:40] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513247 [00:37:14] (03CR) 10Marostegui: [C: 03+2] db2091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/513246 (https://phabricator.wikimedia.org/T224393) (owner: 10Marostegui) [00:38:00] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513247 (owner: 10Marostegui) [00:38:50] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513247 (owner: 10Marostegui) [00:40:48] PROBLEM - Check the Netbox report-s- coherence for fail status. on netmon1002 is CRITICAL: coherence.Coherence CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:40:54] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2091 - T224393 (duration: 00m 56s) [00:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:59] T224393: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 [00:41:08] 10Operations, 10DBA, 10Patch-For-Review: db2091 rebooted unexpectedly - https://phabricator.wikimedia.org/T224393 (10Marostegui) 05Openβ†’03Resolved This host has been repooled [00:44:42] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [00:55:27] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab2001.codfw.wmnet'] ` Of which those *... [02:39:07] (03PS2) 10Andrew Bogott: openstack: keystone: install mysql client binary for cleanup operations [puppet] - 10https://gerrit.wikimedia.org/r/513180 (https://phabricator.wikimedia.org/T224610) (owner: 10Arturo Borrero Gonzalez) [02:39:59] (03CR) 10Andrew Bogott: [C: 03+2] openstack: keystone: install mysql client binary for cleanup operations [puppet] - 10https://gerrit.wikimedia.org/r/513180 (https://phabricator.wikimedia.org/T224610) (owner: 10Arturo Borrero Gonzalez) [02:43:26] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 11 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/phab/phabricator//support/aphlict/server/node_modules],File[/srv/phab/phabricator/support/preamble.php],File[/srv/phab/phabricator/support/redirect_config.json] [03:29:57] 10Operations, 10ExternalGuidance, 10Traffic, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10santhosh) We don't have any other blockers for this puppet patch. Once this puppet patch is deployed, we have a configuration change https... [04:18:18] PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [04:19:34] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:21:30] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Votisky) a:03Votisky [05:34:38] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) a:05Votiskyβ†’03None [05:35:54] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Votisky) 05Openβ†’03Resolved a:03Votisky [05:38:23] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) 05Resolvedβ†’03Open a:05Votiskyβ†’03None [05:39:42] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10mobrovac) [06:21:38] PROBLEM - puppet last run on labweb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python-keystone] [06:24:59] (03PS2) 10Elukey: Introduce profile::analytics::search::jobs [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) [06:29:14] PROBLEM - puppet last run on elastic1033 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:30:57] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Joe) For the record, the above script works once you do ` sudo apt-get install php-apcu php-bcmath php-bz2 php-cli php-common php-curl php... [06:32:36] PROBLEM - puppet last run on db2115 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:33:16] PROBLEM - puppet last run on cp5011 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:34:55] (03PS3) 10Elukey: Introduce profile::analytics::search::jobs [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) [06:48:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/16806/stat1007.eqiad.wmnet/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) (owner: 10Elukey) [06:48:36] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:49:10] 10Operations, 10netops: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) New error! ` /etc/cron.daily/logrotate: error: error setting owner of /var/log/librenms/daily.log to uid 498 and gid 0: Operation not permitted run-parts: /etc/cron.daily/logrotate exited... [06:49:30] (03PS1) 10Elukey: librenms: update logrotate to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/513258 (https://phabricator.wikimedia.org/T224502) [06:50:11] (03CR) 10Elukey: [C: 03+2] librenms: update logrotate to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/513258 (https://phabricator.wikimedia.org/T224502) (owner: 10Elukey) [06:51:09] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Joe) Oh I forgot to add: the list of unloadable extensions could be found in the php7.2-fpm log. You also need to restart php7.2-fpm for i... [06:56:16] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:59:38] RECOVERY - puppet last run on db2115 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:18] RECOVERY - puppet last run on cp5011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:21:10] !log mobrovac@deploy1001 Started deploy [restbase/deploy@92591a7]: Switch to OpenAPI v3 and drop page/html/title/revision/tid - T218218 T215956 [07:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:18] T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 [07:21:20] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [07:31:12] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 404 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 404 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is C [07:31:12] hoid - check test formula returned the unexpected status 404 (expecting: 200): /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 404 (expecting: 200): /transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 404 [07:31:12] /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 404 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 404 (expecting: 200): /page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-s [07:31:12] t page on enwiki returned the unexpected status 404 (expecting: 200): /feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [07:32:09] <_joe_> mobrovac: ^^ [07:32:54] <_joe_> oh right we didn't update service-checker on icinga :D [07:32:58] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 404 (expecting: 200): /page/html/{title} (Get html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 404 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is C [07:32:58] hoid - check test formula returned the unexpected status 404 (expecting: 200): /feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 returned the unexpected status 404 (expecting: 200): /transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 404 [07:32:58] /page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 404 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 404 (expecting: 200): /page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) is CRITICAL: Test Get mobile-s [07:32:58] t page on enwiki returned the unexpected status 404 (expecting: 200): /feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [07:33:04] <_joe_> on it! [07:33:47] <_joe_> !log upgraded service-checker on icinga1001,2 [07:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:02] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:34:08] <_joe_> :) [07:34:22] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [07:38:46] oooh right _joe_ [07:38:50] thank you [07:40:27] <_joe_> what gave this away was [07:40:34] <_joe_> the servers weren't alarming [07:40:37] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@92591a7]: Switch to OpenAPI v3 and drop page/html/title/revision/tid - T218218 T215956 (duration: 19m 28s) [07:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:44] T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 [07:40:45] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [07:48:57] 10Operations, 10RESTBase-API, 10TechCom, 10serviceops, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10mobrovac) [07:49:01] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) 05Openβ†’03Resolved This has now been deployed. [07:50:03] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [07:51:42] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [07:56:39] (03PS1) 10Elukey: librenms: do not logrotate if logfile is empty [puppet] - 10https://gerrit.wikimedia.org/r/513261 (https://phabricator.wikimedia.org/T224502) [07:57:49] (03CR) 10Elukey: [C: 03+2] librenms: do not logrotate if logfile is empty [puppet] - 10https://gerrit.wikimedia.org/r/513261 (https://phabricator.wikimedia.org/T224502) (owner: 10Elukey) [08:09:24] PROBLEM - Nginx local proxy to apache on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 729 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:09:26] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 729 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:09:46] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 729 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:09:46] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 729 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:10:24] <_joe_> mwdebug is me [08:10:41] !log drop old Parsoid tables from cassandra -- T223998 [08:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:46] T223998: Remove old (a.k.a NG) Parsoid tables - https://phabricator.wikimedia.org/T223998 [08:11:18] (03PS4) 10Jcrespo: Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 [08:13:40] RECOVERY - Nginx local proxy to apache on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.059 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:13:40] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.042 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:14:00] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76287 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:14:01] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 76334 bytes in 0.223 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:23:18] (03CR) 10Ottomata: [C: 03+1] Introduce profile::analytics::search::jobs [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) (owner: 10Elukey) [08:23:49] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 (owner: 10Jcrespo) [08:24:41] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 (owner: 10Jcrespo) [08:25:40] (03PS1) 10Mobrovac: RESTBase: Remove restbase10(0[7-9]|1[0-5]) and set them as spares [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) [08:32:50] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2087 for maintenance (duration: 01m 00s) [08:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1089 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513152 (owner: 10Jcrespo) [08:39:01] (03Merged) 10jenkins-bot: mariadb: Depool db1089 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513152 (owner: 10Jcrespo) [08:44:01] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 for maintenance (duration: 00m 57s) [08:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] !log maps2001 postgres initialization - T224395 [08:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:18] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [08:54:19] !log stop and restart db1089 for upgrade [08:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:46] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 [09:05:11] (03PS1) 10Jcrespo: mariadb: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513264 [09:11:52] (03PS1) 10Giuseppe Lavagetto: mediawiki::web: fix position of template [puppet] - 10https://gerrit.wikimedia.org/r/513265 [09:11:54] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: add the ability to properly manage 404s [puppet] - 10https://gerrit.wikimedia.org/r/513266 [09:12:45] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web: fix position of template [puppet] - 10https://gerrit.wikimedia.org/r/513265 (owner: 10Giuseppe Lavagetto) [09:12:51] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::vhost: add the ability to properly manage 404s [puppet] - 10https://gerrit.wikimedia.org/r/513266 (owner: 10Giuseppe Lavagetto) [09:12:56] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513264 (owner: 10Jcrespo) [09:13:46] (03Merged) 10jenkins-bot: mariadb: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513264 (owner: 10Jcrespo) [09:26:36] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1089 with low weight (duration: 00m 55s) [09:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:34] <_joe_> !log depooling mw1261 for benchmarking for T224491 [09:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:39] T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 [09:32:37] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [09:32:37] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:31] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [09:36:02] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 [09:37:40] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [09:42:23] !log stop and upgrade db1102 [09:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:16] (03CR) 10Jcrespo: "Only waiting for buffer pool load." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 (owner: 10Jcrespo) [10:03:52] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:04:34] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [10:07:37] !log upgrade and restart test-s4 hosts (db1111, db1112) [10:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:13] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) (For me it's up to what works best for SRE. Just tell me how / if I can help to get this task closer to `resolved` status... :) If my previous understanding is wrong and there i... [10:14:59] (03CR) 10Volans: "Looks good in general, I trust that the long postgres command was already reviewed/tested. I have only one question/request about safety v" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:15:45] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:15:45] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:14] (03CR) 10Jbond: "> Thanks! That's a good idea. I think a static route on at least one of the router would be even better. Let's say we push a change that m" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [10:17:02] PROBLEM - Host ms-be1050 is DOWN: PING CRITICAL - Packet loss = 100% [10:18:15] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:18:16] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:30] RECOVERY - Host ms-be1050 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [10:19:11] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:19:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:27] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:22:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:22:30] jouncebot, refresh [10:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:31] I refreshed my knowledge about deployments. [10:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:25:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:53] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:28:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:45] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:32:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:13] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:36:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:36:14] PROBLEM - Apache HTTP on mw1347 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:40] RECOVERY - Apache HTTP on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:39:56] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:39:56] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:58] !log upgrade and restart db1117 (temporary proxy fail for passive host, reduced redundancy for m*) [10:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:46] (03PS10) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [10:50:01] PROBLEM - Host ms-be1043 is DOWN: PING CRITICAL - Packet loss = 100% [10:53:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "let's go." [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:55:37] jouncebot, next [10:55:37] In 0 hour(s) and 4 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1100) [10:59:29] damn, _joe_ should we wait for the swat to finish? [10:59:46] <_joe_> no. [10:59:59] <_joe_> just disable puppet everywhere as usual, verify it's a noop [11:00:04] Amir1 and Lucas_WMDE: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1100). [11:00:04] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:08] !log Disable puppet on mw* servers to merge 507939 - T219150 [11:00:10] swat should be pretty fast, just one patch to deploy [11:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:13] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [11:00:14] I can SWAT today! [11:00:23] o/ [11:00:31] Urbanecm: I'm around, in case you need me [11:00:41] thanks zeljkof [11:00:59] (03PS2) 10Urbanecm: Enable abusefilter blocking ability in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513186 (https://phabricator.wikimedia.org/T224617) [11:01:13] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513186 (https://phabricator.wikimedia.org/T224617) (owner: 10Urbanecm) [11:02:07] (03Merged) 10jenkins-bot: Enable abusefilter blocking ability in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513186 (https://phabricator.wikimedia.org/T224617) (owner: 10Urbanecm) [11:03:05] Skipping mwdebug, since it's untestable [11:04:34] !log urbanecm@deploy1001 Synchronized wmf-config/abusefilter.php: [[:gerrit:Enable abusefilter blocking ability in plwiki]] (T224617) (duration: 00m 58s) [11:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:39] T224617: Enable abusefilter blocking ability in plwiki - https://phabricator.wikimedia.org/T224617 [11:05:11] !log EU SWAT finished [11:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:48] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [11:06:31] RECOVERY - Host ms-be1043 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [11:07:01] PROBLEM - PHP7 rendering on mw1281 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 896 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:07:28] <_joe_> uh another case [11:07:36] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:07:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:59] RECOVERY - puppet last run on ms-be1043 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:09:15] (03PS11) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [11:09:50] 10Operations: cron-spam to root@: lsof stderr generates large emails on boron from wmf-auto-restart - https://phabricator.wikimedia.org/T224661 (10Volans) [11:09:58] 10Operations: cron-spam to root@: lsof stderr generates large emails on boron from wmf-auto-restart - https://phabricator.wikimedia.org/T224661 (10Volans) p:05Triageβ†’03High [11:10:21] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [11:10:29] (03PS1) 10Volans: base: ignore lsof stderr in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/513270 (https://phabricator.wikimedia.org/T224661) [11:10:32] jbond42: if you have a sec ^^^ [11:10:43] looking [11:11:10] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:11:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:24] thx [11:12:56] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513270 (https://phabricator.wikimedia.org/T224661) (owner: 10Volans) [11:13:00] np [11:13:29] jijiki: can I merge something in puppet or is a delicate time? [11:14:03] RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 76402 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Application_servers [11:14:04] (03PS2) 10Volans: base: ignore lsof stderr in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/513270 (https://phabricator.wikimedia.org/T224661) [11:14:04] <_joe_> !log freed opcache on mw1281 [11:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:13] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:14:14] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:39] (03CR) 10Volans: [C: 03+2] base: ignore lsof stderr in wmf-auto-restart [puppet] - 10https://gerrit.wikimedia.org/r/513270 (https://phabricator.wikimedia.org/T224661) (owner: 10Volans) [11:18:55] 10Operations: cron-spam to root@: lsof stderr generates large emails on boron from wmf-auto-restart - https://phabricator.wikimedia.org/T224661 (10Volans) p:05Highβ†’03Normal Bandaid applied, no output from running `wmf-auto-restart` on `boron`. It should be good for now. Leaving open for a better long-term so... [11:19:13] thanks for squashing the cronspam [11:19:57] yw :) [11:20:07] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:20:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:39] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:24:39] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:49] (03PS4) 10Elukey: Introduce profile::analytics::search::jobs [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) [11:28:31] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:28:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:50] (03CR) 10Elukey: [C: 03+2] Introduce profile::analytics::search::jobs [puppet] - 10https://gerrit.wikimedia.org/r/513038 (https://phabricator.wikimedia.org/T224200) (owner: 10Elukey) [11:31:24] (03PS1) 10Effie Mouzeli: mediawiki: add newline in mediawiki-vhost.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/513271 [11:34:28] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [11:34:37] !log reboot ganeti2003 for kernel upgrades [11:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:57] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:35:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:18] PROBLEM - Host ganeti2003 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:36] RECOVERY - Host ganeti2003 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [11:36:50] (03PS2) 10Effie Mouzeli: mediawiki: add newline in mediawiki-vhost.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/513271 [11:39:40] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:39:41] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:10] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync [11:41:40] this seems expected due to the ganeti reboot [11:42:22] yep, ConnectTimeoutError [11:44:11] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:44:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:48] (03PS1) 10Arturo Borrero Gonzalez: sssd: include the /etc/ldap.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/513272 (https://phabricator.wikimedia.org/T224558) [11:49:32] (03PS3) 10Effie Mouzeli: mediawiki: add newline in mediawiki-vhost.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/513271 [11:50:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/16810/" [puppet] - 10https://gerrit.wikimedia.org/r/513272 (https://phabricator.wikimedia.org/T224558) (owner: 10Arturo Borrero Gonzalez) [11:50:37] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: add newline in mediawiki-vhost.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/513271 (owner: 10Effie Mouzeli) [11:50:52] (03PS4) 10Effie Mouzeli: mediawiki: add newline in mediawiki-vhost.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/513271 [11:51:10] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync [11:52:42] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:52:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:10] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:57:10] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1200) [12:02:24] (03PS2) 10Jbond: firewall logging: add firewall logging to kafak servers [puppet] - 10https://gerrit.wikimedia.org/r/511705 (https://phabricator.wikimedia.org/T116011) [12:02:55] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:02:56] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:23] (03CR) 10Jbond: [C: 03+2] firewall logging: add firewall logging to kafak servers [puppet] - 10https://gerrit.wikimedia.org/r/511705 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [12:07:49] (03PS4) 10Jbond: admin module: improve CI [puppet] - 10https://gerrit.wikimedia.org/r/510871 [12:08:05] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10jijiki) I downtimed the host on icinga for another week [12:08:11] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:08:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:08] (03PS6) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) [12:11:19] 10Operations, 10serviceops, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10elukey) The best guess that I can give after checking the HHVM memcached extension's code is that the following might be the rea... [12:12:34] (03CR) 10Jbond: [C: 03+2] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:15:01] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:15:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:18] (03PS1) 10Effie Mouzeli: hieradata: enable php72_only on mw2135 [puppet] - 10https://gerrit.wikimedia.org/r/513273 (https://phabricator.wikimedia.org/T219150) [12:17:50] (03PS5) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) [12:20:02] (03CR) 10Jbond: [C: 03+2] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:21:02] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:21:03] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:44] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:25:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:58] (03PS2) 10Effie Mouzeli: hieradata: enable php72_only on mw2135 [puppet] - 10https://gerrit.wikimedia.org/r/513273 (https://phabricator.wikimedia.org/T219150) [12:37:21] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:37:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:24] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:42:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:42:27] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/16812/mw2135.codfw.wmnet/ https://puppet-compiler.wmflabs.org/compiler100" [puppet] - 10https://gerrit.wikimedia.org/r/513273 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [12:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:54] (03PS3) 10Effie Mouzeli: hieradata: enable php72_only on mw2135 [puppet] - 10https://gerrit.wikimedia.org/r/513273 (https://phabricator.wikimedia.org/T219150) [12:44:19] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable php72_only on mw2135 [puppet] - 10https://gerrit.wikimedia.org/r/513273 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [12:44:21] jouncebot: now [12:44:21] For the next 0 hour(s) and 15 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1200) [12:44:23] jouncebot: next [12:44:24] In 0 hour(s) and 15 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1300) [12:47:10] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [12:49:19] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [12:49:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:31] (03PS1) 10Andrew Bogott: glance image sync timer: don't monitor if it's disabled [puppet] - 10https://gerrit.wikimedia.org/r/513276 [12:51:34] (03PS1) 10Effie Mouzeli: hieradata: enable php72_only on mw2135 [puppet] - 10https://gerrit.wikimedia.org/r/513277 (https://phabricator.wikimedia.org/T219150) [12:57:59] <_joe_> jijiki: the commit message is wrong :D [12:58:11] zeljkof: hi. is the train going to be okay? can you merge that patch before rolling it out? [12:58:23] I’m on my phone right now [12:58:27] (03PS2) 10Effie Mouzeli: hieradata: enable php72_only on mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513277 (https://phabricator.wikimedia.org/T219150) [12:58:34] _joe_: I know [12:58:39] I read it as well :p [12:58:44] MatmaRex_: I've just +2d the patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/513237 [12:58:45] MatmaRex_: Which patch? [12:58:46] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.7/includes/Linker.php: T222628 (duration: 01m 04s) [12:58:58] so I can continue with the train [12:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:00] T222628: Some history views and diffs unavailable on Wikipedias (Fatal ParameterAssertionException: Bad value for parameter $dbkey) - https://phabricator.wikimedia.org/T222628 [12:59:16] by the way, Phabricator seems to have stopped accepting email replies to tasks [12:59:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513278 (https://phabricator.wikimedia.org/T221502) [12:59:34] zeljkof: I'm just finishing deploying that patch to prevent a full deployment of something that was fixed in .6 but not in master [12:59:50] zeljkof: thanks! Reedy: that one :) [13:00:04] zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1300). [13:00:16] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.7/tests/phpunit/includes/: T222628 (duration: 01m 06s) [13:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:28] Reedy: which patch are you deploying? [13:00:32] * zeljkof is confused [13:00:41] A different one [13:00:42] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/513275/1 [13:00:43] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513278 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [13:00:57] (03PS1) 10Vgutierrez: redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) [13:00:59] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/16813/" [puppet] - 10https://gerrit.wikimedia.org/r/513277 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [13:01:18] zeljkof: I need to deploy ^ for an onsite maintenance [13:01:21] it will be quick :) [13:01:31] marostegui: ok [13:01:36] Reedy: ok [13:01:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513278 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [13:01:48] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Get rid of Apache specific variables [puppet] - 10https://gerrit.wikimedia.org/r/513077 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:01:50] (03PS2) 10Vgutierrez: redirects.dat: Get rid of Apache specific variables [puppet] - 10https://gerrit.wikimedia.org/r/513077 (https://phabricator.wikimedia.org/T224539) [13:01:55] I'm done though [13:01:56] (03CR) 10jerkins-bot: [V: 04-1] redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:02:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099 T221502 (duration: 00m 56s) [13:03:01] zeljkof: I am done! [13:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:03] T221502: db1099 memory issues - https://phabricator.wikimedia.org/T221502 [13:03:20] !log Stop MySQL on db1099 for onsite maintenance - T221502 [13:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:36] (03PS3) 10Effie Mouzeli: hieradata: enable php72_only on mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513277 (https://phabricator.wikimedia.org/T219150) [13:04:07] <_joe_> jijiki: puppet is still disabled everywhere? [13:04:57] marostegui, Reedy: cool [13:05:26] _joe_: yep [13:05:33] apart from 2 hosts in codfw [13:06:08] why? [13:06:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) 05Resolvedβ†’03Open re-opening as this is going to be worked out. MySQL is stopped on s1 and s8, host downtimed and OS upgraded. It can be taken by @Cmjohnson anytime. Ple... [13:06:46] (03PS2) 10Cmjohnson: Adding mgmt dns for single cpu spare servers [dns] - 10https://gerrit.wikimedia.org/r/510745 (https://phabricator.wikimedia.org/T219890) [13:07:32] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for single cpu spare servers [dns] - 10https://gerrit.wikimedia.org/r/510745 (https://phabricator.wikimedia.org/T219890) (owner: 10Cmjohnson) [13:08:17] !log cp3047 puppet-disable + depool for reimage to ATS - T222937 [13:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:22] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [13:08:49] _joe_: why ? [13:09:24] (03PS2) 10Vgutierrez: redirects.dat: Get rid of www.*.wikipedia.[com,net,info] [puppet] - 10https://gerrit.wikimedia.org/r/513141 (https://phabricator.wikimedia.org/T224539) [13:09:26] (03PS2) 10Vgutierrez: redirects.dat: Ban using .*. [puppet] - 10https://gerrit.wikimedia.org/r/513142 (https://phabricator.wikimedia.org/T133548) [13:09:28] (03PS2) 10Vgutierrez: redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) [13:10:22] (03PS1) 10BBlack: cache: reimage cp3047 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513280 (https://phabricator.wikimedia.org/T222937) [13:10:42] (03CR) 10jerkins-bot: [V: 04-1] redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) (owner: 10Vgutierrez) [13:11:11] (03CR) 10BBlack: [C: 03+2] cache: reimage cp3047 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513280 (https://phabricator.wikimedia.org/T222937) (owner: 10BBlack) [13:11:36] <_joe_> because I thought you would reenable it. it's ok but tell me when you do. [13:12:22] I will, after I ensure that mw1348 will serve properly [13:12:58] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable php72_only on mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513277 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [13:13:14] (03PS4) 10Effie Mouzeli: hieradata: enable php72_only on mw1348 [puppet] - 10https://gerrit.wikimedia.org/r/513277 (https://phabricator.wikimedia.org/T219150) [13:13:23] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3047.esams.wmnet'] ` The log can be found i... [13:14:34] (03PS3) 10Jcrespo: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 [13:14:36] (03PS1) 10Jcrespo: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) [13:15:16] (03CR) 10Marostegui: mariadb: Depool es1019 for maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:16:17] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1007.eqiad.wmnet [13:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:27] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1008.eqiad.wmnet [13:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:40] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1009.eqiad.wmnet [13:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:55] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1010.eqiad.wmnet [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:08] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1011.eqiad.wmnet [13:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:30] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1012.eqiad.wmnet [13:17:35] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1013.eqiad.wmnet [13:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:42] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1014.eqiad.wmnet [13:17:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:47] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1015.eqiad.wmnet [13:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:55] (03CR) 10Jcrespo: mariadb: Depool es1019 for maintenance (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:19:48] (03PS4) 10Jcrespo: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 [13:19:50] (03PS2) 10Jcrespo: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) [13:21:14] (03CR) 10Marostegui: [C: 03+1] mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [13:22:00] !log zfilipin@deploy1001 Synchronized php-1.34.0-wmf.7/resources/src/jquery/jquery.suggestions.js: SWAT: [[gerrit:513237|jquery.suggestions: Do not show suggestions on prefilled values ([T224524])]] (duration: 00m 58s) [13:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:06] T224524: Various pages (Special:Contributions, Special:WhatLinksHere) shows autocomplete / search suggestions dropdown on page load - https://phabricator.wikimedia.org/T224524 [13:22:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/compiler1001/16814/" [puppet] - 10https://gerrit.wikimedia.org/r/513262 (https://phabricator.wikimedia.org/T223976) (owner: 10Mobrovac) [13:23:12] (03PS5) 10Jcrespo: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 [13:23:19] (03PS3) 10Vgutierrez: redirects.dat: Provide support for nginx in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/513279 (https://phabricator.wikimedia.org/T224539) [13:24:24] (03PS3) 10Jcrespo: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) [13:26:57] zeljkof: I need to deploy 2 mediawiki-config patches to depool servers for hw maintenance [13:27:21] jynus: I'm in the middle of train, can it wait? [13:27:40] cmjohnson1 ^ [13:28:03] !log Enabling puppet on mw*.codfw.net [13:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:26] _joe_: I am enabling puppet on mw* [13:28:28] (03PS1) 10Zfilipin: all wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513284 [13:28:30] (03CR) 10Zfilipin: [C: 03+2] all wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513284 (owner: 10Zfilipin) [13:28:48] <_joe_> cool [13:29:22] <_joe_> zeljkof: can you let me and jijiki know before you actually run scap? [13:29:28] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513284 (owner: 10Zfilipin) [13:30:41] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) after even the first simplest tests, it's absolutely clear to me that running with... [13:30:42] _joe_, jijiki: I'm running it now o.O [13:30:42] zeljkof: that would be great actually [13:30:45] oh [13:30:48] ok [13:30:51] it's train window? [13:31:04] I thought I don't have to make it explicit [13:31:21] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1300 [13:31:33] yeah but we have been pushing some php7 changes [13:32:01] uh oh [13:32:19] well, can you do it outside train window? [13:32:29] if you have started already we should wait [13:32:47] <_joe_> zeljkof: no specifically we wanted to watch what happened during the train deploy with php7 and opcache [13:32:59] ah [13:33:15] <_joe_> it was a request for coordination, but I guess it was asking way too much. [13:33:16] ok, you should have let me know, so I'm not as scared as I am now :) [13:33:31] <_joe_> I said "let me know" [13:33:40] <_joe_> not OMG DON'T :) [13:33:44] ok, so I'll continue with the train, scap was paused while https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/513284 was being merged [13:34:22] _joe_: it's cool, I was just surprised, I thought you were deploying something while I was running the train [13:34:29] <_joe_> yes, it's enough to know "scap is running" [13:34:30] I guess it's just a misunderstanding [13:34:37] ok, so scap is running, now [13:34:46] <_joe_> cool, thanks! [13:35:37] I've seen this before: [13:35:41] `13:35:14 Check 'Check endpoints for mwdebug1002.eqiad.wmnet' failed: /wiki/{title} (Special Version) timed out before a response was received` [13:36:08] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.7 [13:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:51] _joe_, jijiki: scap is done [13:37:00] ok [13:37:03] no problems, only that one warning [13:37:43] <_joe_> zeljkof: yeah that's just a common failure on mwdebug1002 [13:37:47] <_joe_> it will be better with php7 [13:38:28] <_joe_> jijiki: let's look at how is mw1348 faring compared to before [13:38:34] PROBLEM - HHVM rendering on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:39:32] RECOVERY - HHVM rendering on mw1347 is OK: HTTP OK: HTTP/1.1 200 OK - 76170 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:40:03] <_joe_> jijiki: ok max children reached on that api server, but I guess we expected it [13:40:40] I think as well, I will keep monitoring anyway [13:40:54] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:40:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:40:57] <_joe_> btw [13:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:07] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?panelId=62&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1348&from=now-3h&to=now [13:41:53] yes it is doing get instead of gets [13:42:12] we had a discussion about that [13:42:24] <_joe_> well this is proof. [13:42:31] hehe [13:43:10] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:43:50] PROBLEM - HHVM rendering on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:44:12] (03PS6) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) [13:44:18] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:45:12] <_joe_> uhm [13:45:16] <_joe_> what's going on? [13:45:43] 10Operations, 10serviceops, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10Joe) 05Openβ†’03Resolved we transitioned our first appserver (api) to full php7 and it confirms indeed the theory https://graf... [13:46:14] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 76172 bytes in 2.776 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:46:44] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.041 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:46:50] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:47:46] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:49:01] hmm [13:50:27] _joe_: I am enabling puppet in eqiad and I am depooling mw1348 [13:50:37] there are some errors I would like to take a look [13:50:56] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:40] <_joe_> what do you mean depooling? [13:51:51] <_joe_> is the error rate higher than normal? [13:51:54] I will depool it [13:52:12] <_joe_> can I now which errors? [13:53:18] give me a sec [13:53:31] <_joe_> sure [13:55:13] <_joe_> I literally see no 500 from that server [13:55:41] ok thefe are a few Wikimedia\Rdbms\LBFactory::commitAndWaitForReplication: LinksUpdate::incrTableUpdate does not have outer scope. [13:55:44] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [13:55:59] but I doubt it is specific to that server [13:56:07] <_joe_> and those are just from that server? no. [13:56:16] <_joe_> breathe :D [13:56:32] no they are allover [13:57:15] !log enable puppet on mw* in eqiad [13:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:27] <_joe_> those errors were spread across hhvm and php [13:57:37] (03PS1) 10BBlack: install_console: explicitly use IPv4 only [puppet] - 10https://gerrit.wikimedia.org/r/513290 [13:59:10] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [14:00:27] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:00:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:56] (03PS2) 10Jbond: flake8 - pcc: add extension so file can be checked [puppet] - 10https://gerrit.wikimedia.org/r/510489 (https://phabricator.wikimedia.org/T144169) [14:02:30] (03CR) 10Jbond: [C: 03+2] flake8 - pcc: add extension so file can be checked [puppet] - 10https://gerrit.wikimedia.org/r/510489 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [14:05:25] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:05:25] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:31] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513290 (owner: 10BBlack) [14:07:18] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/513290 (owner: 10BBlack) [14:14:22] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3047.esams.wmnet'] ` and were **ALL** successful. [14:15:32] (03PS3) 10Andrew Bogott: wmfkeystonehooks: create security groups via the neutron API [puppet] - 10https://gerrit.wikimedia.org/r/511461 [14:16:13] (03PS1) 10Jhedden: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) [14:16:17] (03PS1) 10Jhedden: admins: Add jeh to ops group [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) [14:18:06] 10Operations, 10serviceops, 10User-jijiki: Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster - https://phabricator.wikimedia.org/T223647 (10jijiki) {F29279472} :D [14:21:00] (03PS2) 10Jhedden: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) [14:21:02] (03PS2) 10Jhedden: admins: Add jeh to ops group [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) [14:21:19] jijiki: _joe_ zeljkof what is the deployment status? finished? things stable? [14:21:31] <_joe_> yes [14:21:33] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:21:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:40] yes to stable, I guess? [14:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:52] what about finished? for zeljkof [14:22:18] !log rebooting cp3047 (post-reimage/puppetization for T222937) [14:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:23] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [14:24:27] (03CR) 10BBlack: [C: 03+2] install_console: explicitly use IPv4 only [puppet] - 10https://gerrit.wikimedia.org/r/513290 (owner: 10BBlack) [14:24:35] (03PS2) 10BBlack: install_console: explicitly use IPv4 only [puppet] - 10https://gerrit.wikimedia.org/r/513290 [14:25:31] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:25:31] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:25:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:44] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:29:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:11] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [14:36:27] (03PS3) 10Jhedden: admins: Add jeh to production shell users [puppet] - 10https://gerrit.wikimedia.org/r/513292 (https://phabricator.wikimedia.org/T224627) [14:36:29] (03PS3) 10Jhedden: admins: Add jeh to ops group [puppet] - 10https://gerrit.wikimedia.org/r/513293 (https://phabricator.wikimedia.org/T224627) [14:36:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Cmjohnson) [14:37:04] 10Operations, 10cloud-services-team: Reimage cloudvirtan* to Stretch - https://phabricator.wikimedia.org/T224566 (10Andrew) It's most likely that these won't wind up used as virt hosts at all. I'l'l reimage them in the meantime just in case -- it's not much work while they're empty. [14:38:27] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Cmjohnson) @greg @robh I am just plugging these disks into the server correct? nothing else? this will not r... [14:38:50] (03PS1) 10Andrew Bogott: cloudvirtan: move all hosts to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513295 (https://phabricator.wikimedia.org/T224566) [14:40:05] (03PS2) 10Andrew Bogott: cloudvirtan: move all hosts to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513295 (https://phabricator.wikimedia.org/T224566) [14:41:09] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirtan: move all hosts to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513295 (https://phabricator.wikimedia.org/T224566) (owner: 10Andrew Bogott) [14:42:37] !log reimaging cloudvirtan1001 [14:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:28] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:43:28] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:19] !log reimaging cloudvirtan1001 for T224566 [14:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:24] T224566: Reimage cloudvirtan* to Stretch - https://phabricator.wikimedia.org/T224566 [14:49:54] (03PS6) 10Jcrespo: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 [14:53:35] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 (owner: 10Jcrespo) [14:54:19] jynus: is it you working on T202367? dbproxy2004 was reported above by the netbox report because has state PLANNED still [14:54:20] T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] - https://phabricator.wikimedia.org/T202367 [14:54:29] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 (owner: 10Jcrespo) [14:55:18] volans: I don't think anyone is working at the moment on T202367 [14:55:32] it is #DBA (next) [14:56:15] (03PS1) 10Andrew Bogott: cloudvirtan: update nic labels for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513296 (https://phabricator.wikimedia.org/T224566) [14:56:29] jynus: ack I'm move it into staged like the others then [14:56:39] volans: most likely it was rebooted and autoinstalled but not provisioned [14:56:50] so not pending, but not in production [14:56:56] yes, staged [14:56:56] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirtan: update nic labels for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/513296 (https://phabricator.wikimedia.org/T224566) (owner: 10Andrew Bogott) [14:57:18] in the limbo between dcops and dbas [14:57:38] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:58:14] and here the recovery :) [14:59:37] "Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK" [14:59:57] it is almost as bad as our backup checks [15:00:12] volans: we kinda decided to not use check on checks [15:00:26] what do you mean? [15:00:36] the wording? [15:00:43] removing the "check" on icinga checks :-D [15:00:51] for redundancy [15:00:55] redir=301 to chaomodus: ^^^ [15:01:12] what is report-s-? [15:01:19] sure, will tell him [15:01:32] no big deal, just I didn't understood the line [15:01:38] it's icinga 'report(s)' [15:01:46] ah [15:01:51] becasue it could check multiple ones, but I think we shouildn't anyway [15:01:51] so it is the icinga thing [15:01:54] I know [15:01:56] so happy to force it singular [15:02:31] but the "s." must be a mistake then [15:03:18] (03PS4) 10Jcrespo: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) [15:04:14] 08Warning Alert for device asw-d-codfw.mgmt.codfw.wmnet - Port with no description on access switch [15:04:40] jynus: sorry, just saw your ping, finished, things looked ok [15:05:17] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [15:06:06] Yah i was under a misapprehention of what that message is for ;) [15:06:08] (03Merged) 10jenkins-bot: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [15:06:13] i noticed that the language on it was weird. [15:06:16] !log pooled maps2004 - osm import is complete - T224395 [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:22] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [15:06:49] (03PS1) 10Alexandros Kosiaris: Fix TLS support for kask and cassandra connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/513297 (https://phabricator.wikimedia.org/T220401) [15:06:51] (03PS1) 10Alexandros Kosiaris: Package and publish kask 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513298 (https://phabricator.wikimedia.org/T220401) [15:15:27] (03PS11) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [15:18:45] (03PS12) 10Jbond: varnish: ratelimit unusual image sizes [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) [15:19:46] !log performing rolling reboots of eqiad kafka main cluster hosts for security updates [15:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:47] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1089 with full weight; depool es1019 (duration: 00m 52s) [15:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] RECOVERY - Maps - OSM synchronization lag - codfw on icinga1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 5.541e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [15:26:54] !log shutting down db1099 to swap DIMM T221502 [15:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:00] T221502: db1099 memory issues - https://phabricator.wikimedia.org/T221502 [15:27:09] 10Operations, 10serviceops, 10wikitech.wikimedia.org, 10PHP 7.2 support, 10Patch-For-Review: switch wikitech to PHP 7.2 - https://phabricator.wikimedia.org/T223393 (10Dzahn) >>! In T223393#5223143, @Joe wrote: > Once you've done that, I'd say let's be bold and just change the proxy/rewrite rules from ht... [15:27:27] (03PS2) 10Jbond: firewall logging: Enable logging on misc services [puppet] - 10https://gerrit.wikimedia.org/r/511706 (https://phabricator.wikimedia.org/T116011) [15:29:13] (03CR) 10Jbond: [C: 03+2] firewall logging: Enable logging on misc services [puppet] - 10https://gerrit.wikimedia.org/r/511706 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [15:33:01] (03PS3) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [15:33:54] PROBLEM - Host db1099.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:34:04] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:35:57] ^expected [15:36:02] ah, he logged it already [15:36:44] (03PS4) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [15:36:57] !log stop es1019 for maintenance T213422 [15:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:02] T213422: es1019 IPMI and its management interface are unresponsive (again) - https://phabricator.wikimedia.org/T213422 [15:37:57] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:39:24] RECOVERY - Host db1099.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [15:42:03] (03PS1) 10Paladox: Phabricator: Fix ssh config to listen on git-ssh.w.org ipv6 address too [puppet] - 10https://gerrit.wikimedia.org/r/513303 [15:42:37] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Cmjohnson) Swapped DIMM A5 with DIMM B5 and cleared the racadm log. [15:42:41] (03PS2) 10Paladox: Phabricator: Fix ssh config to listen on git-ssh.w.org ipv6 address too [puppet] - 10https://gerrit.wikimedia.org/r/513303 [15:43:13] (03PS3) 10Paladox: Phabricator: Fix ssh config to listen on git-ssh.w.org ipv6 address too [puppet] - 10https://gerrit.wikimedia.org/r/513303 [15:43:20] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513303 (owner: 10Paladox) [15:44:13] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) Thanks - I will take it from here [15:45:13] (03PS5) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [15:46:01] (03PS2) 10Alexandros Kosiaris: Fix TLS support for kask and cassandra connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/513297 (https://phabricator.wikimedia.org/T220401) [15:46:03] (03PS2) 10Alexandros Kosiaris: Package and publish kask 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513298 (https://phabricator.wikimedia.org/T220401) [15:46:49] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) a:05Cmjohnsonβ†’03Marostegui [15:47:13] 10Operations, 10ops-eqiad, 10DBA: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) mysql started and replication catching up [15:49:11] (03PS4) 10Paladox: Phabricator: Fix ssh config to listen on git-ssh.w.org ipv6 address too [puppet] - 10https://gerrit.wikimedia.org/r/513303 [15:49:24] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513303 (owner: 10Paladox) [15:49:37] (03CR) 10Jbond: "ready for review/comment" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [15:49:54] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) [15:50:23] (03CR) 10Marostegui: [C: 04-2] "Replication still catching up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [15:50:44] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [15:51:39] (03PS3) 10Alexandros Kosiaris: Fix TLS support for kask and cassandra connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/513297 (https://phabricator.wikimedia.org/T220401) [15:51:41] (03PS3) 10Alexandros Kosiaris: Package and publish kask 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513298 (https://phabricator.wikimedia.org/T220401) [15:52:12] PROBLEM - puppet last run on analytics1046 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [15:53:33] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513304 (https://phabricator.wikimedia.org/T221502) [15:53:51] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513307 [15:54:05] (03CR) 10Jcrespo: [C: 04-2] "Not ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513307 (owner: 10Jcrespo) [15:56:53] (03PS1) 10Volans: icinga: clarify Puppet alert message [puppet] - 10https://gerrit.wikimedia.org/r/513310 [15:57:40] (03CR) 10Dzahn: [C: 03+2] Phabricator: Fix ssh config to listen on git-ssh.w.org ipv6 address too [puppet] - 10https://gerrit.wikimedia.org/r/513303 (owner: 10Paladox) [15:59:47] 10Operations, 10cloud-services-team: Reimage cloudvirtan* to Stretch - https://phabricator.wikimedia.org/T224566 (10Andrew) 05Openβ†’03Resolved a:03Andrew [15:59:49] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [16:00:01] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [16:00:05] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:46] !log phab1001 - removing git-ssh.wm.org IP from interface - phab1003 - activating IPv6 listen address for git-ssh [16:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:11] !log phab1001 - removing 10.64.32.186/32 from eth0 [16:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:22] !log phab1001 - removing 2620:0:861:103:10:64:32:186/128 from eth0 [16:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:22] RECOVERY - puppet last run on analytics1046 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:24:34] !log re-pool cp3047 into service as ats-be - T222937 [16:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:40] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [16:27:15] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [16:41:09] jouncebot, next [16:41:09] In 0 hour(s) and 18 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1700) [16:48:30] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) Opened T224679 [16:54:50] (03Abandoned) 10Urbanecm: Add vnwikimedia to DNS [dns] - 10https://gerrit.wikimedia.org/r/467425 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [16:54:56] (03Abandoned) 10Urbanecm: Add vn.wikimedia.org to wikimedia-chapter [puppet] - 10https://gerrit.wikimedia.org/r/467525 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [16:55:16] (03Abandoned) 10Urbanecm: Initial configuration for vnwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467528 (https://phabricator.wikimedia.org/T207052) (owner: 10Urbanecm) [16:59:18] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:00:04] cscott, arlolra, subbu, and halfak: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1700). [17:00:12] no parsoid deploy today [17:10:56] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Some queries causes wdqs-blazegraph on wdqs1006 to crash and restart - https://phabricator.wikimedia.org/T213191 (10Smalyshev) 05Openβ†’03Stalled p:05Normalβ†’03Low [17:15:05] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) [17:16:20] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) a:05Papaulβ†’03herron @herron All is done at my end what left to be done is just the OS install. Let me know if you have any questions [17:16:36] thanks papaul! [17:18:04] (03PS1) 10BBlack: cache: reimage cp3049 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/513317 (https://phabricator.wikimedia.org/T222937) [17:18:20] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10Papaul) @herron also after the OS install please remember to change Netbox status to "staged" [17:28:06] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) [17:33:02] 10Operations, 10Analytics: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10ayounsi) p:05Triageβ†’03High [17:33:16] 10Operations, 10Analytics: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10ayounsi) [17:35:48] ACKNOWLEDGEMENT - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 3754 MB (2% inode=86%): Ayounsi https://phabricator.wikimedia.org/T224682 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [17:36:56] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:38:36] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [17:39:03] (03PS1) 10Krinkle: profiler: Remove fake "HTTP-verb" stack frame from HHVM sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513320 [17:39:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) Swapped DIMM B3 with DIMM A3 and cleared the log. [17:42:41] (03PS4) 10Alexandros Kosiaris: cassandra::single_instance: Remove thrift ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/513122 [17:42:42] (03PS1) 10Alexandros Kosiaris: Add sessionstore.discovery.wmnet TLS cert [puppet] - 10https://gerrit.wikimedia.org/r/513323 (https://phabricator.wikimedia.org/T220401) [17:43:01] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested by deploying on staging (but without the SSL cert for kask itself). Worked fine, proceeding" [deployment-charts] - 10https://gerrit.wikimedia.org/r/513297 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [17:43:08] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Package and publish kask 0.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/513298 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [17:47:30] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:49:14] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:50:38] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:53:00] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [17:55:59] (03CR) 10Krinkle: [C: 03+2] profiler: Remove fake "HTTP-verb" stack frame from HHVM sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513320 (owner: 10Krinkle) [17:56:20] * Krinkle staging on mwdebug1002 [17:56:58] ACKNOWLEDGEMENT - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 23 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/srv/phab/phabricator//support/aphlict/server/node_modules],File[/srv/phab/phabricator/support/preamble.php],File[/srv/phab/phabricator/support/redirect_config.json] daniel_zahn dz working on it [17:57:02] (03Merged) 10jenkins-bot: profiler: Remove fake "HTTP-verb" stack frame from HHVM sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513320 (owner: 10Krinkle) [18:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1800). Please do the needful. [18:00:04] kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:34] (03PS1) 10Paladox: Phabricator: Fix aphlict to not try and start service if ensure == absent [puppet] - 10https://gerrit.wikimedia.org/r/513327 [18:01:54] (03PS1) 10Alexandros Kosiaris: Add sessionstore LVS DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/513328 (https://phabricator.wikimedia.org/T220401) [18:02:15] (03CR) 10jerkins-bot: [V: 04-1] Add sessionstore LVS DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/513328 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris) [18:02:34] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:02:34] ACKNOWLEDGEMENT - Check the Netbox report-s- coherence for fail status. on netmon1002 is CRITICAL: coherence.Coherence CRITICAL Ayounsi https://phabricator.wikimedia.org/T223450 https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:03:04] (03PS2) 10Paladox: Phabricator: Fix aphlict to not try and start service if ensure == absent [puppet] - 10https://gerrit.wikimedia.org/r/513327 [18:03:13] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/513327 (owner: 10Paladox) [18:03:33] !log krinkle@deploy1001 Synchronized wmf-config/arclamp.php: (no justification provided) (duration: 00m 53s) [18:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:44] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [18:04:05] here [18:05:01] Hi kostajh, if no one else plans to do that, I can SWAT your patch to production! [18:05:10] ooh, I was about to ask who's around [18:05:12] sounds good Urbanecm [18:06:16] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:06:18] PROBLEM - HHVM rendering on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:06:56] PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:07:21] kostajh, +2'ed your patch, waiting for CI [18:07:28] PROBLEM - Apache HTTP on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:07:32] cool. it's going to be a while :\ [18:07:58] yeah [18:08:25] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) More errors, that I presume are due to corruption due to being impossible based... [18:08:57] Urbanecm: please wait for a few minutes - investigating a possible issue [18:09:04] (fine to merge, don't stage/deploy yet) [18:09:11] Krinkle, okay, let me know when I can continue [18:11:32] !log mw1321 php7.2 shows signs of corruption for over 2 hours – https://phabricator.wikimedia.org/T224491#5224464 [18:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:50] !log mw1348 (recent api/php72 100% experiment) shows signs of corruption [18:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:32] RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 659 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:12:42] !log Running `php7adm /opcache-free` on mw1348 and mw1321 [18:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:54] !log Running `php7adm /opcache-free` on mw1348 and mw1321, T224491 [18:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:00] T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 [18:13:04] RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 658 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:13:14] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 76218 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:13:18] RECOVERY - HHVM rendering on mw1348 is OK: HTTP OK: HTTP/1.1 200 OK - 76218 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:13:54] Urbanecm: continue :) [18:14:00] thanks Krinkle [18:14:22] thanks Krinkle [18:15:08] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 11 ge 4 Ayounsi https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [18:16:28] ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 10 ge 4 Ayounsi https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [18:18:53] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) The mw1321 issue seems to have recovered (based on querying `host:mw1321` on th... [18:24:34] !log bounce eqord-ulsfo interface to try to fix BFD sessions [18:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:56] RECOVERY - BFD status on cr2-eqord is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:26:08] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 0 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:26:37] seems like it worked [18:26:38] !log phab1003 - switch 'vcs' user to 'NP' to match phab1001 setup and then /srv/phab/phabricator# ./bin/config set diffusion.ssh-user vcs (T224677) [18:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:43] T224677: Cannot connect to vcs@git-ssh.wikimedia.org - https://phabricator.wikimedia.org/T224677 [18:35:44] kostajh, patch got merged. Going to fetch to mwdebug1002, will ping you once it's there [18:39:05] kostajh, can you test your patch at mwdebug1002? [18:39:13] Urbanecm: yes, will do now [18:39:20] thanks [18:42:11] 10Operations, 10ops-codfw, 10Traffic: lvs2002: raid battery failure - https://phabricator.wikimedia.org/T213417 (10ayounsi) a:03Papaul Assigning that to Papaul as I think next steps are for DCops, but please reassign if incorrect. [18:44:15] Urbanecm: all good, please proceed [18:44:21] kostajh, okay, going to deploy [18:47:37] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/GrowthExperiments/: [[:gerrit:513300|QuestionPoster: Correctly set timestamp when question is posted]] (T223338) (duration: 00m 51s) [18:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:42] T223338: [wmf.5] timestamp for recent questions is not updated in Homepage Help desk and Mentorship modules - https://phabricator.wikimedia.org/T223338 [18:47:48] kostajh, should be live! [18:48:58] Urbanecm: thank you! [18:49:04] you're welcome! [18:49:15] since nothing else is to be deployed... [18:49:20] !log Morning SWAT finished [18:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:10] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), 10Wikimedia-production-error: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) @joe I'm excluding php7.2 from all Logstash monitoring related to MediaWiki for... [18:53:02] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10ayounsi) [18:53:05] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10ayounsi) 05Resolvedβ†’03Open Reopening as it alerted again: https://icinga.wikimedia.org/cgi-bin/... [18:53:17] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10ayounsi) p:05Unbreak!β†’03High [18:53:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group in admin for jeh - https://phabricator.wikimedia.org/T224627 (10bd808) Manager approval given for production access [18:59:40] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Krinkle) [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T1900) [19:02:02] does anyone know of any recent ssh changes to either wikimedia or debian that would cause my ssh://cscott@gerrit.wikimedia.org:29418 gerrit connections to stop working? [19:02:29] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10ayounsi) p:05Triageβ†’03High [19:05:25] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10ayounsi) [19:05:27] cscott: see #wikimedia-cloud you might not be the only one [19:05:36] Reedy that's gerrit not phab :) [19:05:54] naming things is hard [19:06:20] ssh paladox@gerrit.wikimedia.org -p 29418 works for me. [19:08:17] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10bd808) [19:10:05] paladox: does ifconfig show a global ipv6 address assigned? [19:11:42] gerrit's ipv6 works for me [19:11:48] telnet 2620::861:3:208:80:154:85 29418 [19:11:54] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) >>! In T198901#5220959, @Krenair wrote: > Is apertium part o... [19:12:29] 10Operations, 10Traffic: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 (10ayounsi) p:05Triageβ†’03Normal [19:12:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Backlog), 10User-fgiunchedi: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10greg) [19:13:17] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Krenair) Okay. Is it potentially in-scope i.e. should it appear on one... [19:13:36] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10ayounsi) [19:18:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [19:19:10] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [19:19:42] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) Indeed. I added it under TBD. The exact way this will be don... [19:20:32] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10herron) Discussed on IRC adding here to close the loop ` 3:16 PM XioNoX: looks like we have 813 messages queued for a single gmail user causing that alert 3:16 PM SMTP error from remote ma... [19:27:35] paladox: 208.80.154.85 on ipv4, right? (for gerrit.wikimedia.org) [19:27:54] yup, it's ip is 208.80.154.85 [19:27:58] (for ipv4) [19:28:14] debug1: Local version string SSH-2.0-OpenSSH_7.9p1 Debian-10 [19:28:14] debug1: Remote protocol version 2.0, remote software version GerritCodeReview_2.15.13-13-gd782b2dd6b (SSHD-CORE-1.6.0) [19:28:14] debug1: no match: GerritCodeReview_2.15.13-13-gd782b2dd6b (SSHD-CORE-1.6.0) [19:28:14] debug1: Authenticating to gerrit.wikimedia.org:29418 as 'cscott' [19:28:17] debug1: SSH2_MSG_KEXINIT sent [19:28:19] [...] [19:28:21] debug1: Server host key: ssh-rsa SHA256:j7HQoQ6fIuEgDHjONjI2CZ+2Iwxqgo2Ur5LbPqBgxOU [19:28:23] debug1: Host '[gerrit.wikimedia.org]:29418' is known and matches the RSA host key. [19:28:26] debug1: Found key in /home/cananian/.ssh/known_hosts:521 [19:28:28] debug1: rekey after 4294967296 blocks [19:28:30] debug1: SSH2_MSG_NEWKEYS sent [19:28:32] debug1: expecting SSH2_MSG_NEWKEYS [19:28:34] debug1: SSH2_MSG_NEWKEYS received [19:28:37] debug1: rekey after 4294967296 blocks [19:28:39] and then it hangs [19:28:58] i connect to the WMF-FULL VPN, so i'm "calling from inside the house" and its pretty clear I'm talking to someone... [19:28:59] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [19:29:23] (03CR) 10Eevans: "LGTM, but AFAIK this exists only for maps, so we should probably circle them in just in case." [puppet] - 10https://gerrit.wikimedia.org/r/513122 (owner: 10Alexandros Kosiaris) [19:30:17] ah, figured it out -- https://apple.stackexchange.com/questions/277479/openssh-hangs-at-rekey-after-134217728-blocks [19:30:29] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Jdforrester-WMF) [19:30:39] hanging trying to connect to my ssh-agent. even though it already found an identity file (!) [19:31:20] 10Operations, 10Traffic, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10Jdforrester-WMF) Given this is being used now, does this count as Resolved? Or do you want to keep this open until clean-up? [19:38:28] (03PS1) 10Ayounsi: Add IPv6 to RPKI hosts [puppet] - 10https://gerrit.wikimedia.org/r/513364 [19:43:44] 10Operations, 10Traffic: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 (10BBlack) That alert basically means that a varnish frontend daemon crashed (and as usual was auto-restarted by a manager process). These are pretty rare and usually worth some investig... [19:45:04] Hi - is the train running now? We need to deploy a fix for a UBN. [19:45:41] James_F: Do you know? ^ [19:45:52] Niharika: Go ahead. [19:46:00] James_F: Cool, thanks! [19:46:56] 10Operations, 10observability: mx1001 exim queue warning - https://phabricator.wikimedia.org/T224692 (10herron) Added some high level troubleshooting tips at https://wikitech.wikimedia.org/wiki/Exim#Troubleshooting_"exim_queue_warning"_alerts [19:49:11] James_F: Zuul seems choked. :/ [19:49:45] musikanimal: ^ We might not be able to deploy that anytime soon. [19:49:49] !log sodium (mirrors) - sudo -u mirror /usr/local/sbin/update-ubuntu-mirror [19:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:06] uh oh [19:51:30] Niharika: the SWAT pipeline has priority. [19:51:34] (03PS4) 10Bstorm: cloudstore: switch maps mounts from labstore1003 to cloudstore1008 [puppet] - 10https://gerrit.wikimedia.org/r/509470 (https://phabricator.wikimedia.org/T209527) [19:51:47] James_F: How do I get my patch in SWAT pipeline? [19:52:06] Niharika: Anything C+2'ed in a wmf.X branch will be in that automatically. [19:52:06] It's in the regular gate-and-submit queue. [19:52:43] There, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/513372 [19:52:47] Ah so I should cherry pick without waiting for it merging to master first? [19:52:53] Thanks! [19:53:13] You can cherry pick it before, after or during merging [19:53:47] Reedy: Got it. I thought there was an unsaid rule about merging to master first. [19:54:14] YMMV :) [19:54:17] No hard and fast [19:54:18] Even in the faster queue it shows an ETA of 21 mins. [19:54:31] jerkins lies [19:55:09] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16817/rpki1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513364 (owner: 10Ayounsi) [19:56:54] (03CR) 10Dzahn: [C: 03+1] Add IPv6 to RPKI hosts [puppet] - 10https://gerrit.wikimedia.org/r/513364 (owner: 10Ayounsi) [19:57:07] (03CR) 10Herron: [C: 03+1] Add IPv6 to RPKI hosts [puppet] - 10https://gerrit.wikimedia.org/r/513364 (owner: 10Ayounsi) [19:58:29] (03CR) 10Ayounsi: [C: 03+2] Add IPv6 to RPKI hosts [puppet] - 10https://gerrit.wikimedia.org/r/513364 (owner: 10Ayounsi) [20:07:43] musikanimal: A test failed on that patch - https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PageTriage/+/513372/ I think it's related to the patch. [20:08:49] that's a phan test that passed when the original patch was merged [20:09:00] or phan wasn't enabled yet, not sure [20:11:06] `rsync: failed to set times on "/cache/.": Operation not permitted (1)` is that it? [20:12:58] the new patch just bumps the cache version, there should be no linting errors as a result of this [20:16:15] Merging https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/513375 which should fix the ORES issues. [20:16:33] isn't there a way to `recheck` only the phan test? [20:17:58] You can just force merge if you know the failure isn't related etc [20:20:23] James did it, thank you! [20:23:31] (03PS1) 1020after4: Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513379 (https://phabricator.wikimedia.org/T224677) [20:25:12] (03CR) 10Dzahn: [C: 03+2] Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513379 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [20:25:17] (03CR) 10Paladox: [C: 03+1] Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513379 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [20:29:22] (03PS1) 10Ayounsi: Add IPv6 to RPKI hosts [dns] - 10https://gerrit.wikimedia.org/r/513405 [20:29:57] (03CR) 10jerkins-bot: [V: 04-1] Add IPv6 to RPKI hosts [dns] - 10https://gerrit.wikimedia.org/r/513405 (owner: 10Ayounsi) [20:30:26] (03CR) 10Dzahn: "ooops.. Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating " [puppet] - 10https://gerrit.wikimedia.org/r/513379 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [20:31:20] (03PS2) 10Ayounsi: Add IPv6 to RPKI hosts [dns] - 10https://gerrit.wikimedia.org/r/513405 [20:31:31] (03PS1) 1020after4: Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513407 (https://phabricator.wikimedia.org/T224677) [20:31:49] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513407 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [20:33:07] (03PS2) 1020after4: Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513407 (https://phabricator.wikimedia.org/T224677) [20:33:59] (03PS3) 1020after4: Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513407 (https://phabricator.wikimedia.org/T224677) [20:34:08] PROBLEM - puppet last run on phab1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [20:34:21] (03CR) 10Dzahn: [C: 03+2] Phabricator: write an ssh.log for phabricator's sshd [puppet] - 10https://gerrit.wikimedia.org/r/513407 (https://phabricator.wikimedia.org/T224677) (owner: 1020after4) [20:39:09] !log phabricator: restart ssh-phab.service [20:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:32] RECOVERY - puppet last run on phab1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:39:59] (03PS4) 10Andrew Bogott: wmfkeystonehooks: create security groups via the neutron API [puppet] - 10https://gerrit.wikimedia.org/r/511461 [20:40:02] (03CR) 10Dzahn: [C: 03+1] Add IPv6 to RPKI hosts [dns] - 10https://gerrit.wikimedia.org/r/513405 (owner: 10Ayounsi) [20:46:48] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: create security groups via the neutron API [puppet] - 10https://gerrit.wikimedia.org/r/511461 (owner: 10Andrew Bogott) [20:50:01] (03CR) 10Ayounsi: [C: 03+2] Add IPv6 to RPKI hosts [dns] - 10https://gerrit.wikimedia.org/r/513405 (owner: 10Ayounsi) [20:52:32] musikanimal: I can deploy it now. Standby for testing? [20:53:16] yep! [20:53:20] let's do this [20:54:45] (03PS1) 10Ayounsi: Routinator, make daemon listen on v6 socket as well [puppet] - 10https://gerrit.wikimedia.org/r/513478 [20:55:19] musikanimal: Ready to test on mwdebug1002. [20:56:12] (03CR) 10Ayounsi: [C: 03+2] Routinator, make daemon listen on v6 socket as well [puppet] - 10https://gerrit.wikimedia.org/r/513478 (owner: 10Ayounsi) [20:56:28] no dice. I wonder if mwdebug would work here since this is a memcached thing? [20:57:01] Hmm. [20:57:05] So sync? [20:57:14] one sec [20:58:18] should work fine on debug, it's using a separate memc key with this patch [21:00:10] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [21:00:11] hmm okay, wonder what it could be [21:02:35] musikanimal: I can verify that it's not working. The code on mwdebug1002 has the correct patch though. [21:02:42] So that is probably not the issue? [21:03:07] I can still sync it because it doesn't seem to break anything more. Should I? [21:03:53] eh, I'd like to see it working! [21:04:43] musikanimal: I'm looking at the code and wondering now it would work. I mean, the original patch added a key to $searchableTags that will be used in iteration, during which it will search for the same key in $tags. But, where how is the 'recreate' key meant to appear in $tags / ValidTags from the database. Do those rows always exist, or is there meant to be a default of sorts somewhere? [21:05:31] yeah, I was going to ask if someone could check pagetriage_tags [21:05:36] Oh I see. I ignored the schema change because I wasn't expecting schema to involve content for said schema, that's interesting. [21:05:38] you should see 'recreated' in there [21:06:07] does the schema change run automatically? [21:06:09] rolling out the patch should be fine though, we can keep looking at it :) [21:06:17] No, no automatic schema changes happen in production. [21:06:26] okay so that's the problem [21:06:27] was this scheduled/requested with DBAs? [21:06:53] I don't think we need DBA here, it's just adding a single row. Not really a "schema" change, per se [21:07:04] We do not allow code to depend on schema changes, everything must be back-compat for 1 schema change. Because we can't stop time :) The exception being if the schema can be forward-compat and applied first, but that's too late now. [21:07:06] Yeah, fair enough. [21:07:12] this isn't the first time we've done this, but the first time we've done it wrong! [21:07:21] !log add RPKI sessions on cr4-ulsfo - T220669 [21:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:27] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [21:07:28] https://quarry.wmflabs.org/query/36526 [21:07:37] doesn't look like it's there [21:07:45] I'll run it on mwmaint [21:07:47] but. [21:07:48] right, that's the replicas [21:07:51] the cache key is now ruiined [21:07:59] oh yeah dammit [21:08:08] okay, we'll just have to do it all over again [21:08:13] Are there other patches waiting? [21:08:20] for SWAT? dunno [21:08:27] I suggest rolling this out, and we can finish this separately. I'll add the row to the db meanwhile [21:08:59] awesome, thank you! [21:09:09] and then we need another patch to bump the cache version yet again, I assume [21:10:18] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/PageTriage: Bump wgPageTriageCacheVersion T224693 (duration: 00m 51s) [21:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:23] T224693: Error from ApiPageTriageList.php: "Undefined index: recreated" - https://phabricator.wikimedia.org/T224693 [21:10:38] I rolled that patch out. [21:10:45] Thanks Krinkle. [21:11:08] !log krinkle@mwmaint1002 Add 1 row to pagetriage_tags table on enwiki, based on PageTriageTagsPatch-recreated.sql. T224693, T189929 [21:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:15] T189929: Add "previously deleted" as a possible issue in the New Pages Feed - https://phabricator.wikimedia.org/T189929 [21:11:28] musikanimal: any other wikis? [21:12:05] Krinkle: Done for testwiki yet? [21:12:17] yes just testwiki and enwiki [21:12:30] And test2wiki. [21:12:40] !log krinkle@mwmaint1002 Add 1 row to pagetriage_tags table on testwiki db, based on PageTriageTagsPatch-recreated.sql. T224693, T189929 [21:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:21] test2wiki appears many more versions behind [21:13:44] I wouldn't worry about that one [21:13:44] ugh, [21:13:47] wmgEnablePageTriage is true, though. [21:13:49] * James_F shrugs. [21:13:51] enwiki has 21 rows now (with patch applied) [21:14:00] testwiki has 20 rows now (with patch applied) [21:14:05] test2wiki has 17 rows (patch not applied) [21:14:11] so something is still out of sync [21:14:38] Maybe we should undeploy it. [21:14:50] we haven't done any testing over there, that I am sure. Would not oppose undeployment [21:15:17] musikanimal: ignoring test2wiki, what was previous change for PageTriage, maybe that was forgotten on testwiki [21:15:35] I do need another patch to bump the cache version, right? I don't understand why this cheap lookup needs to be cached anyway, but I digress [21:16:00] musikanimal: yeah one more patch [21:16:06] Looks like 'afc_state' is missing on testwiki, but present on enwiki [21:16:10] does that sound familiar? [21:18:12] (03CR) 10jenkins-bot: MWScript.php: Mark refreshMessageBlobs.php as a global script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513174 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [21:18:15] (03CR) 10jenkins-bot: build: on CI only lint changed files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/491564 (owner: 10Hashar) [21:18:20] (03CR) 10jenkins-bot: Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray) [21:18:32] (03CR) 10jenkins-bot: mariadb: Depool db1089 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513152 (owner: 10Jcrespo) [21:18:46] James_F: lol, that postmerge only took 22 hours to complete xD [21:18:50] https://gerrit.wikimedia.org/r/c/513495/ [21:18:56] (03CR) 10jenkins-bot: Enable abusefilter blocking ability in plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513186 (https://phabricator.wikimedia.org/T224617) (owner: 10Urbanecm) [21:19:11] yeah AfC is an enwiki-only thing, so not surprised if there's some parts of it missing on testwiki [21:19:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513278 (https://phabricator.wikimedia.org/T221502) (owner: 10Marostegui) [21:19:23] I believe Growth used the beta cluster to test that bit [21:19:26] Krinkle: `INSERT INTO /*_*/pagetriage_tags (ptrt_tag_name, ptrt_tag_desc) VALUES ('user_experience', 'Experience level: newcomer, learner, experienced or anonymous' ); [21:19:26] ` is missing on testwiki but in the .sql file. [21:19:37] musikanimal: OK. So which keys it tries to access and expect varies by site config? [21:19:51] Or should this be applied to testwiki as well? [21:20:04] Krinkle: afc_state is on testwiki? [21:20:05] also, given these appear immutable, looks like this can just be an array in PHP but oh well. [21:20:11] James_F: no, enwiki only currently [21:20:15] that's the missing row (count 21 vs 20) [21:20:24] Krinkle: Not for me. [21:20:27] well, suspected missing row I don't know if it matters [21:20:31] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513284 (owner: 10Zfilipin) [21:20:58] Krinkle: `mwscript sql.php --wiki=testwiki` `SELECT * FROM pagetriage_tags` gives me 20 results; missing `user_experience`. [21:21:17] ah, the order is different [21:21:18] indeed [21:21:39] And yes, we should probably fix testwiki to inject that one too. [21:21:54] But the fact that no-one noticed suggests we should switch it off on testwiki as well. [21:21:54] (03CR) 10jenkins-bot: mariadb: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513281 (https://phabricator.wikimedia.org/T213422) (owner: 10Jcrespo) [21:22:44] James_F: not just no-one noticed. it appears the php error doesn't happen there [21:22:50] which means maybe the row is not needed. [21:22:50] Hmmph. [21:22:56] anyway, will leave unchanged for now [21:22:57] Or just never triggered? [21:23:00] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513247 (owner: 10Marostegui) [21:23:01] yeah, maybe [21:23:09] we do use testwiki, or have used it [21:23:38] James_F: We plan to use it on NPP testwiki. [21:23:38] and we *should* have used it to test the "recreated" stuff [21:24:04] draining extensions from testwikis seems like a good plan overall, shifting more towards beta cluster instead for actual testing before prod. There will still be cases (esp around cross-wiki, central auth, Swift, Varnish etc. stuff) where testwiki is useful, but for the most part I imagine it's a bit late to test something there [21:24:19] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1089 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513263 (owner: 10Jcrespo) [21:24:20] musikanimal: ah, good point. yeah, you'd have noticed the missing schema change. [21:24:25] Well, long-term we want to delete production test wikis. [21:24:34] Beta cluster has way too many issues to be something we can offer to users to test on. [21:24:37] Krinkle: Do you want to fix testwiki now, given you've done that already. [21:24:38] yeah, that's what I mean by aiming for reducing use of it where we don't need it. [21:24:44] It's great for internal QA but not users. [21:24:47] Niharika: You shouldn't point users at testwiki either. [21:24:53] I'll have to point out that we didn't bump the PageTriage cache version on the beta cluster but the new code still worked... [21:24:57] James_F: Why not? [21:25:00] that seems like a beta feature use case [21:25:07] Niharika: It's a catastrophe of breakingness. [21:25:10] if it's in prod already, might as well go on the prod wikis. [21:25:22] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2087 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513149 (owner: 10Jcrespo) [21:25:37] musikanimal: Beta Cluster runs update.php and drops cache all the time. It's not comparable. [21:25:42] (03CR) 10jenkins-bot: mariadb: Repool db1089 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513264 (owner: 10Jcrespo) [21:25:45] James_F: NPP used to work fairly okay on testwiki last time I tested on it. [21:25:57] (03CR) 10jenkins-bot: profiler: Remove fake "HTTP-verb" stack frame from HHVM sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513320 (owner: 10Krinkle) [21:25:57] But yeah, I agree with general breakingness. [21:26:12] yeah that's my point, we wouldn't have noticed this issue. But obviously we didn't take advantage of testwiki for testing it either =p [21:26:18] It's way better than beta cluster though. And doesn't require new login. [21:27:14] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:27:26] !log krinkle@mwmaint1002 Add 1 row to pagetriage_tags table on test2wiki db, based on PageTriageTagsPatch-recreated.sql. T224693, T189929 [21:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:33] T224693: Error from ApiPageTriageList.php: "Undefined index: recreated" - https://phabricator.wikimedia.org/T224693 [21:27:33] T189929: Add "previously deleted" as a possible issue in the New Pages Feed - https://phabricator.wikimedia.org/T189929 [21:27:53] Krinkle: Wait, did you do that to testwiki or test2wiki? [21:28:08] I've fixed todays issue on all three wikis the table exists on [21:28:13] Right. [21:28:14] last one was test2wiki [21:28:52] I'd prefer not to apply the older schema change now, unless we know it causes an issue. Just because it's from longer ago and don't want to touch it :) [21:31:56] Psh, weakness. :-) [21:40:29] Niharika: was this btw a late swat or other deploy? [21:40:41] looking to roll out two more patches just making sure I'm not clashing [21:40:57] Krinkle: It's a regular deploy. :) Not a swat. [21:41:00] k [21:41:34] Krinkle: Happy to let you deploy it if you were planning to. [21:41:35] (03PS1) 10Florianschmidtwelzow: [phabricator] Remove extra comma from footer [puppet] - 10https://gerrit.wikimedia.org/r/513501 [21:41:49] Niharika: ah well, if you're okay with doing more (based on the concurrent +2), go ahead :) [21:42:00] (03PS2) 10Florianschmidtwelzow: [phabricator] Remove extra comma from footer [puppet] - 10https://gerrit.wikimedia.org/r/513501 [21:42:07] (03CR) 10Florianschmidtwelzow: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/513501 (owner: 10Florianschmidtwelzow) [21:42:16] I've merged https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/513411/ at the same time, so beware during the pull. [21:43:04] Krinkle: I can deploy that one too. Anything else to deploy? [21:43:19] nope, that's it for today. thanks :) [21:49:02] (03CR) 10Krinkle: [phabricator] Remove extra comma from footer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/513501 (owner: 10Florianschmidtwelzow) [21:53:40] okay same Phan issue at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/513495 [21:54:14] something to do with rsync. Not our patch, at least [22:01:00] Looks like someone may've upgraded phan or changed its configuration without it being back-compat with master branches [22:01:06] in the last hour? [22:01:26] yeah, I could only find an error about rsync [22:01:46] who can V+2? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/513495 [22:02:07] ah, the rsync is a red herring [22:02:19] 00:02:47.279 [22:02:19] 00:02:47.279 [22:02:19] 00:02:47.280 [22:02:19] 00:02:47.280 [22:02:19] 00:02:47.281 [22:02:24] go to expand the sections [22:02:46] seems genuine, just weird that it didn't complain before. [22:02:51] see I thought I had done that [22:02:54] Did someone merge something getting rid of the Block class? [22:03:25] I thought it was going to be deprecated for a long while. [22:03:31] anywa, doesn't affect wmf branch [22:03:33] only master [22:04:08] and it's not from the last hour because James_F overrode https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/PageTriage/+/513355/ with +2 as well [22:04:10] so that explains that :) [22:04:16] * James_F grins. [22:04:26] yeah, Block alias was added in 2-3 days in core, seems phan doesn't understand it. [22:04:46] Ooooh, interesting. [22:04:49] Dear Phan… [22:05:04] and not in the gate, so only found now. [22:05:16] How did we fix it for other class aliases? [22:05:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16818/phab2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513327 (owner: 10Paladox) [22:06:06] (03PS3) 10Dzahn: Phabricator: Fix aphlict to not try and start service if ensure == absent [puppet] - 10https://gerrit.wikimedia.org/r/513327 (owner: 10Paladox) [22:06:55] James_F: well, that's a long story, but from a quick glance at https://github.com/wikimedia/mediawiki/commit/e65a5b5882 it seems like it's using the same method for aliasing as what we did for IDatabase etc. (in short: same file as destination class, and coded in the autoloader directly). [22:08:08] Hmm. [22:08:14] But it doesn't work? [22:08:27] I vaguely remember seeing other phan issues with class aliases. [22:08:42] Phan seems fine with all the use of rdbms global aliases we had [22:08:54] pretty sure there's still some code using that and that's not being complained about [22:09:08] then then again, maybe that's a very common ignore or excemption, or maybe it's due to an upgrade in phan [22:11:11] sorry to sidetrack, can we V+2 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/513495 ? I'll fix the Phan errors in a separate patch [22:11:36] the wmf commit isn't failing because that patch only landed in core recently [22:11:49] so that could be rebased later on said patch. [22:12:13] (reason being, it could mask other errors and rather not start a practice of bypassing V generally) [22:12:33] gotcha, so there's no rush to merge the master patch [22:14:18] well, needs to land before next Tuesday [22:14:40] sure [22:14:41] but yeah, no rush :) [22:14:53] filed https://phabricator.wikimedia.org/T224704 [22:14:56] Niharika: are we good to SWAT https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageTriage/+/513497 ? [22:15:19] musikanimal: Oh it got merged, yay! I can deploy it now. [22:15:49] (03PS1) 10Dzahn: Revert "Phabricator: Fix aphlict to not try and start service if ensure == absent" [puppet] - 10https://gerrit.wikimedia.org/r/513522 [22:16:40] musikanimal: It's on mwdebug1002. [22:16:49] Krinkle: Your change is there too. [22:17:13] Niharika: works! \o/ [22:17:14] It works! :D [22:17:17] (03CR) 10Dzahn: [C: 03+2] Revert "Phabricator: Fix aphlict to not try and start service if ensure == absent" [puppet] - 10https://gerrit.wikimedia.org/r/513522 (owner: 10Dzahn) [22:17:20] Yayayay. [22:18:16] Niharika: confirmed for me as well [22:19:09] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/PageTriage/: Fix broken feed - T224693 (duration: 00m 51s) [22:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:18] T224693: Error from ApiPageTriageList.php: "Undefined index: recreated" - https://phabricator.wikimedia.org/T224693 [22:19:19] musikanimal: Fix in prod^ [22:19:37] phew! thank you Niharika :) [22:20:48] !log niharika29@deploy1001 Synchronized php-1.34.0-wmf.7/extensions/WikimediaEvents/: Avoid division by zero warnings T224686 (duration: 00m 49s) [22:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:55] T224686: PHP error "Division by zero" from WikimediaEventsHooks.php - https://phabricator.wikimedia.org/T224686 [22:21:18] Thanks for working on it, musikanimal! :) [22:21:24] Krinkle: Deployed! [22:24:20] thanks! [22:24:32] !log phab2001 - scap pull - but it fails with directory /srv/mediawiki not found that's so wrong [22:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:04] !log phab2001 / phab1003 - why is 'git status' in /srv/phab/phabricator unclean with lots of file deletions but also not identical [22:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:16] (03CR) 10Dzahn: "my comments remain unchanged" [puppet] - 10https://gerrit.wikimedia.org/r/484308 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [22:38:48] (03PS2) 10Dzahn: nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292 [22:40:34] (03CR) 10Paladox: [C: 03+1] nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292 (owner: 10Dzahn) [22:42:22] (03CR) 10Dzahn: "seems like most people are filtering their gerrit mail? :(" [puppet] - 10https://gerrit.wikimedia.org/r/506542 (https://phabricator.wikimedia.org/T220860) (owner: 10Dzahn) [22:50:17] (03CR) 10Dzahn: "this looks empty. still a draft?" [puppet] - 10https://gerrit.wikimedia.org/r/510626 (owner: 10Paladox) [22:50:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10dduvall) a:05hasharβ†’03dduvall >>! In T219850#5224558, @ayounsi wrote: > Reopening as it alerted... [22:52:10] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:53:41] !log deleting stale docker images from contint1001, cc: T207707 T219850 [22:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:48] T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 [22:53:48] T207707: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 [22:59:31] !log add terms to drop specific icmp frag packets from cr1/2-eqiad - T224186 [22:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:45] !log deleted 95 docker images from contint1001, freeing ~ 8G on / cc: T219850 [22:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:50] T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 [23:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190530T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:03:20] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10dduvall) [23:03:23] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10dduvall) 05Openβ†’03Resolved Alert is back to OK. [23:25:48] 10Operations: Debian mirror in sync with upstream - https://phabricator.wikimedia.org/T224706 (10ayounsi) p:05Triageβ†’03High [23:26:23] ACKNOWLEDGEMENT - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. Ayounsi https://phabricator.wikimedia.org/T224706 https://wikitech.wikimedia.org/wiki/Mirrors [23:36:21] !log remove BGP sessions to starhub on cr4-ulsfo (left the IXP) [23:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:17] 10Operations: Debian mirror in sync with upstream - https://phabricator.wikimedia.org/T224706 (10Dzahn) The Debian mirror sync uses ftpsync, unlike the Ubuntu mirror sync. So you won't find a puppetized cron and rsync like you do for Ubuntu. The relevant config is /etc/ftpsync/ftpsync.conf The logs are /var/lo...