[00:16:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.100 second response time [00:21:33] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1955 bytes in 0.087 second response time [00:26:19] Dereckson: hm? [00:31:20] AaronSchulz: was considering to deploy https://gerrit.wikimedia.org/r/#/c/414310/ to get the db load balancer factory from the container [00:31:24] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 316 MB (3% inode=75%) [00:31:34] as the evening SWAT window was quiet [00:36:31] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) [00:37:19] 10Operations, 10Deployments, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128799 (10awight) [00:50:17] Dereckson: seems fine [00:54:20] 10Operations, 10netops: asw1-eqsin vcp port flapping - https://phabricator.wikimedia.org/T192125#4128813 (10ayounsi) p:05Triage>03High [01:20:45] 10Operations, 10netops, 10Patch-For-Review: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4128840 (10ayounsi) From JTAC, the nonstop-routing issue most likely have been caused by a Junos bug where the following commit sometimes enables nonstop-routing before disabling graceful-restart, while t... [01:30:24] PROBLEM - ensure kvm processes are running on labvirt1015 is CRITICAL: PROCS CRITICAL: 96 processes with regex args /usr/bin/kvm [01:31:25] RECOVERY - ensure kvm processes are running on labvirt1015 is OK: PROCS OK: 94 processes with regex args /usr/bin/kvm [01:43:46] (03PS2) 10Krinkle: beta: Combine commons, deployments, meta and zero vhost (2) [puppet] - 10https://gerrit.wikimedia.org/r/425858 (owner: 10Jcrespo) [02:09:40] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4128865 (10Dzahn) Adding a custom contact group to the "LVS HTTP IPv4" service doesn't look trivial to me. First we have the de... [02:24:29] (03PS1) 10Dzahn: installserver: convert nested roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/425945 [02:33:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1948 bytes in 0.100 second response time [02:44:08] (03PS1) 10Dzahn: aptrepo::wikimedia: convert from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/425946 [03:07:39] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4128886 (10bearND) I think the services themselves should be sufficient for us. We probably don't need to hosts themselves if it... [04:13:43] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1946 bytes in 0.100 second response time [05:00:44] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.126 second response time [05:04:16] When I tried to sql wikishared on terbium, "Error looking up DB "wikishared"" -- what can be wrong here? [05:05:44] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1964 bytes in 0.104 second response time [05:09:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425957 [05:09:29] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425957 [05:11:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425957 (owner: 10Marostegui) [05:12:59] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425957 (owner: 10Marostegui) [05:13:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425957 (owner: 10Marostegui) [05:14:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 after alter table (duration: 01m 01s) [05:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [05:18:13] (03PS1) 10Marostegui: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425958 [05:19:08] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425959 [05:19:11] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425959 [05:22:13] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425959 (owner: 10Marostegui) [05:23:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425959 (owner: 10Marostegui) [05:23:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425959 (owner: 10Marostegui) [05:24:03] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [05:25:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425958 (owner: 10Marostegui) [05:26:16] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425958 (owner: 10Marostegui) [05:27:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 after alter table (duration: 01m 01s) [05:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:25] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425958 (owner: 10Marostegui) [05:31:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425960 (https://phabricator.wikimedia.org/T187089) [05:32:44] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1975 bytes in 0.082 second response time [05:32:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425960 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:34:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425960 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:34:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425960 (https://phabricator.wikimedia.org/T187089) (owner: 10Marostegui) [05:35:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 for alter table (duration: 01m 00s) [05:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:04] !log Deploy schema change on db1104 - T187089 T185128 T153182 [05:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:11] T187089: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089 [05:37:11] T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182 [05:37:11] T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128 [05:37:53] marostegui: sql command on terbium doesn't work for me. Do you know where to get more details? It works for other people I checked. [05:37:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.152 second response time [05:39:02] kart_: what error do you get? [05:39:09] When I tried to sql wikishared on terbium, "Error looking up DB "wikishared"" -- what can be wrong here? [05:39:15] marostegui: ^^ [05:39:42] That actually works for me [05:39:49] :~ [05:48:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 from main traffic - T191996 (duration: 01m 00s) [05:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:51] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [05:49:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 from main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425961 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [05:49:24] marostegui: no. I'm using new ssh key only to connect terbium. That's again strange. [05:49:40] can you try this? [05:49:42] (give me a sec) [05:50:11] ah no, nevermind, you need a pass for that:) [05:52:31] what if you try: mysql -u wikiadmin -p -h 10.64.32.26 --port 3306 -D wikishared [05:52:36] does that ask for a passowrd? [05:52:38] password [05:59:00] checking. Sorry, was bit distracted. [05:59:36] marostegui: yes [05:59:42] asking password. [06:02:17] and what do you get from: sql wikishared -e "select now()" -BN [06:03:06] marostegui: checking [06:04:02] marostegui: Error looking up DB "wikishared" [06:04:07] same error. [06:04:18] can you do: screen -S [06:04:23] I will attach to that screen and see what you get [06:04:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.099 second response time [06:14:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.094 second response time [06:28:33] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:33:06] (03PS1) 10ArielGlenn: no stat1005 /sv/dumps rsyncs to dumps servers until there's data [puppet] - 10https://gerrit.wikimedia.org/r/425962 (https://phabricator.wikimedia.org/T189283) [06:35:47] (03CR) 10ArielGlenn: [C: 032] no stat1005 /sv/dumps rsyncs to dumps servers until there's data [puppet] - 10https://gerrit.wikimedia.org/r/425962 (https://phabricator.wikimedia.org/T189283) (owner: 10ArielGlenn) [06:51:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1954 bytes in 0.082 second response time [06:56:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1945 bytes in 0.098 second response time [07:01:52] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4129085 (10EddieGP) >>! In T188913#4127994, @thcipriani wrote: > Well the deployment-mediawiki-07 backend was the cause of 503s today. I changed the appserver backend in hier... [07:06:27] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4023631 (10MoritzMuehlenhoff) @EddieGP : I'm not sure what changed with the addition of mediawiki07, but I can confirm that mediawiki04 was definitely serving traffic as of T... [07:08:30] If I want to remove a cron from the fleet, I'd have to make the first patch 'ensure => absent' it and then a second one to remove the code (to be merged after puppet definitely was run on all appservers), right? [07:08:42] Or is there some special handling for such cleanup stuff? [07:12:11] eddiegp: yeah, those two steps are needed, we don't have a fancier solution here [07:12:31] Okay, I'll do that then. [07:15:53] !log pooling mw1265 and mw1279 for production traffic [07:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:20] !log restarting jenkins [07:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.095 second response time [07:28:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.107 second response time [07:30:55] (03PS4) 10Elukey: role::analytics_cluster::coordinator: add the eventlogging whitelist [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) [07:33:20] (03PS1) 10EddieGP: mediawiki: Disable updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425967 (https://phabricator.wikimedia.org/T192139) [07:33:22] (03PS1) 10EddieGP: mediawiki: Remove updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425968 (https://phabricator.wikimedia.org/T192139) [07:35:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.214 second response time [07:42:29] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10919/" [puppet] - 10https://gerrit.wikimedia.org/r/425810 (https://phabricator.wikimedia.org/T189691) (owner: 10Elukey) [07:44:33] (03CR) 10Ema: [C: 031] Remove Varnish config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424552 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [07:47:22] (03CR) 10Ema: [C: 031] Remove LVS/pybal config for image scaler cluster [puppet] - 10https://gerrit.wikimedia.org/r/424553 (https://phabricator.wikimedia.org/T188062) (owner: 10Muehlenhoff) [07:48:21] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request for Raz be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#4129157 (10Addshore) [07:48:34] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request for Tonina to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#4129158 (10Addshore) [07:49:29] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-Addshore: Request for Pablo to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T177599#4129159 (10Addshore) [07:50:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1959 bytes in 0.104 second response time [07:51:14] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic: lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082#4129162 (10ema) p:05Triage>03Normal [07:51:32] (03PS1) 10Giuseppe Lavagetto: role::beta::mediawiki: remove inclusion of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/425969 [07:51:38] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4129163 (10ema) [07:52:18] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4129167 (10Addshore) [07:52:31] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4129169 (10Addshore) [07:52:50] !log mobrovac@tin Started restart [electron-render/deploy@94d27d7]: Kick Electron, hanging - T174916 [07:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:57] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [07:54:13] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Tarrow to the ldap/wmde group - https://phabricator.wikimedia.org/T192060#4126243 (10Addshore) I have removed the line about L2 from here and T191523. We already realised that we don't need it in Feb, but apparently the template was not... [07:54:29] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to add Matthias Geisler to the ldap/wmde group - https://phabricator.wikimedia.org/T191523#4108485 (10Addshore) I have removed the line about L2 from here and T192060. We already realised that we don't need it in Feb, but apparently the templa... [07:55:53] PROBLEM - pdfrender on scb2006 is CRITICAL: connect to address 10.192.32.20 and port 5252: Connection refused [07:56:53] RECOVERY - pdfrender on scb2006 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.100 second response time [07:58:04] !log mobrovac@tin Started restart [electron-render/deploy@94d27d7]: Kick Electron, hanging, take 2 - T174916 [07:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:11] T174916: electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916 [07:58:19] (03CR) 10Muehlenhoff: [C: 031] "Looks good, profile::base::firewall is included in all mediawiki roles already." [puppet] - 10https://gerrit.wikimedia.org/r/425969 (owner: 10Giuseppe Lavagetto) [08:01:33] (03CR) 10Jcrespo: [C: 032] Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 (owner: 10Rduran) [08:01:35] (03CR) 10Jcrespo: [V: 032 C: 032] Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 (owner: 10Rduran) [08:02:11] (03CR) 10Jcrespo: [V: 032 C: 032] Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 (owner: 10Rduran) [08:03:08] (03CR) 10Jcrespo: [V: 032 C: 032] Make WMFMariaDB.py flake8 compliant [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424558 (owner: 10Rduran) [08:03:44] (03CR) 10Giuseppe Lavagetto: [C: 032] role::beta::mediawiki: remove inclusion of base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/425969 (owner: 10Giuseppe Lavagetto) [08:04:23] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [08:04:23] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:04:23] PROBLEM - nutcracker process on deploy1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (nutcracker), command name nutcracker [08:04:34] PROBLEM - nutcracker port on deploy1001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [08:05:08] (03CR) 10Jcrespo: [V: 032 C: 032] Add tests for the argument parsing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424560 (owner: 10Rduran) [08:19:10] (03PS4) 10Rduran: Add tests for the argument parsing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424560 [08:20:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425970 (https://phabricator.wikimedia.org/T191996) [08:21:22] (03PS1) 10Vgutierrez: install_server: Reimage nescio.wikimedia.org as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425971 [08:21:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425970 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [08:21:56] (03CR) 10Jcrespo: [V: 032 C: 032] Add tests for the argument parsing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/424560 (owner: 10Rduran) [08:23:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425970 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [08:23:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425970 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [08:24:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully depool db1114 - T191996 (duration: 01m 00s) [08:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:56] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [08:33:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, minor comment inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425945 (owner: 10Dzahn) [08:33:45] (03CR) 10Alexandros Kosiaris: [C: 031] aptrepo::wikimedia: convert from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/425946 (owner: 10Dzahn) [08:37:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425973 [08:39:40] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425973 (owner: 10Marostegui) [08:40:54] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425973 (owner: 10Marostegui) [08:41:08] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425973 (owner: 10Marostegui) [08:41:36] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage nescio.wikimedia.org as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425971 (owner: 10Vgutierrez) [08:41:44] (03PS2) 10Vgutierrez: install_server: Reimage nescio.wikimedia.org as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425971 [08:42:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1114 in API - T191996 (duration: 01m 00s) [08:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:32] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [08:46:24] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Gilles) Do you want to rephrase this task's description to be about the incident? As a "let's investigate what happened"... [08:49:59] (03PS3) 10Rduran: Add integration tests to test agains MariaDB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/425291 [08:50:31] (03PS1) 10Giuseppe Lavagetto: deployment-prep: decommission deployment-mediawiki{04,05}, add deployment-mediawiki-09 [puppet] - 10https://gerrit.wikimedia.org/r/425975 [08:51:01] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: decommission deployment-mediawiki{04,05}, add deployment-mediawiki-09 [puppet] - 10https://gerrit.wikimedia.org/r/425975 (owner: 10Giuseppe Lavagetto) [08:51:36] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) [08:52:11] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:52:30] !log depool and reimage nescio.wikimedia.org as stretch [08:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:44] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-blazegraph-exporter [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) [08:52:53] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1977 bytes in 0.101 second response time [08:53:13] !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=nescio.wikimedia.org,service=pdns_recursor [08:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.094 second response time [09:00:44] PROBLEM - Host 2620:0:862:1:91:198:174:106 is DOWN: PING CRITICAL - Packet loss = 100% [09:01:42] ^ 2620:0:862:1:91:198:174:106 is nescio which is reimaged [09:02:54] PROBLEM - Recursive DNS on 91.198.174.106 is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:03:18] !log reimaging mw1276-mw1278 to stretch (T174431) [09:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:28] T174431: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431 [09:03:43] PROBLEM - Host 2620:0:862:1:91:198:174:106 is DOWN: PING CRITICAL - Packet loss = 100% [09:23:25] (03CR) 10Elukey: "Very good first step, I added some comments to tidy up a bit how things are presented to be a bit more maintainable in the future (in my o" (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [09:24:27] (03PS2) 10Giuseppe Lavagetto: deployment-prep: update scap dsh lists [puppet] - 10https://gerrit.wikimedia.org/r/425975 (https://phabricator.wikimedia.org/T192071) [09:25:08] (03PS1) 10Muehlenhoff: Remove conftool configuration for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/425982 [09:27:09] (03PS2) 10Muehlenhoff: Remove conftool configuration for image scalers [puppet] - 10https://gerrit.wikimedia.org/r/425982 [09:27:22] (03PS3) 10Giuseppe Lavagetto: deployment-prep: update scap dsh lists [puppet] - 10https://gerrit.wikimedia.org/r/425975 (https://phabricator.wikimedia.org/T192071) [09:28:20] (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: update scap dsh lists [puppet] - 10https://gerrit.wikimedia.org/r/425975 (https://phabricator.wikimedia.org/T192071) (owner: 10Giuseppe Lavagetto) [09:29:12] 10Operations, 10Traffic: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129273 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['nescio.wikimedia.org'] ``` Of which those **FAILED**: ``` ['nescio.wikimedia.org'] ``` [09:33:21] !log start reimage of es1013 [09:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:30] (03PS4) 10Rduran: Add integration tests to test agains MariaDB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/425291 [09:39:31] (03CR) 10Jcrespo: [V: 032 C: 032] Add integration tests to test agains MariaDB [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/425291 (owner: 10Rduran) [09:46:07] (03PS2) 10Gilles: navtiming: Remove broken 'rendering' and 'pageSpeed' metrics [puppet] - 10https://gerrit.wikimedia.org/r/424595 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [09:46:27] (03CR) 10Gilles: [C: 031] navtiming: Remove broken 'rendering' and 'pageSpeed' metrics [puppet] - 10https://gerrit.wikimedia.org/r/424595 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [09:48:13] (03PS1) 10Lokal Profil: Allow prefix to override "all" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425987 [09:49:13] (03PS1) 10Jcrespo: mariadb-autoinstall: Return reimage configuration state back to normal [puppet] - 10https://gerrit.wikimedia.org/r/425988 [09:54:26] (03PS3) 10Lokal Profil: Support prefixed dump types [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) [09:55:10] PROBLEM - Host 2620:0:862:1:a6ba:dbff:fe30:d0df is DOWN: PING CRITICAL - Packet loss = 100% [09:55:51] (03CR) 10Lokal Profil: "Realised the truthy dump is named diferently from the others." [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [09:57:13] (03PS1) 10Vgutierrez: puppetboard: Show times in UTC [puppet] - 10https://gerrit.wikimedia.org/r/425990 [09:57:20] PROBLEM - Disk space on mw1276 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:20] PROBLEM - dhclient process on mw1276 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:20] PROBLEM - Disk space on mw1277 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:57:20] PROBLEM - dhclient process on mw1277 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:59:06] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/425990 (owner: 10Vgutierrez) [09:59:11] PROBLEM - mediawiki-installation DSH group on mw1276 is CRITICAL: Host mw1276 is not in mediawiki-installation dsh group [09:59:11] PROBLEM - HHVM processes on mw1276 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:59:11] PROBLEM - HHVM processes on mw1277 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:59:11] PROBLEM - mediawiki-installation DSH group on mw1277 is CRITICAL: Host mw1277 is not in mediawiki-installation dsh group [09:59:27] (03CR) 10Vgutierrez: [C: 032] puppetboard: Show times in UTC [puppet] - 10https://gerrit.wikimedia.org/r/425990 (owner: 10Vgutierrez) [09:59:34] ^reimages, silencing again [09:59:36] (03PS2) 10Vgutierrez: puppetboard: Show times in UTC [puppet] - 10https://gerrit.wikimedia.org/r/425990 [10:00:31] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:00:31] PROBLEM - High CPU load on API appserver on mw1277 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:00:42] PROBLEM - HHVM rendering on mw1276 is CRITICAL: connect to address 10.64.0.71 and port 80: Connection refused [10:00:42] PROBLEM - HHVM rendering on mw1277 is CRITICAL: connect to address 10.64.0.72 and port 80: Connection refused [10:00:42] PROBLEM - nutcracker port on mw1276 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:00:42] PROBLEM - nutcracker port on mw1277 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:01:49] !log installing java security updates on meiterium/archive.wikimedia.org [10:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:21] PROBLEM - Host 2620:0:862:1:a6ba:dbff:fe30:d0df is DOWN: CRITICAL - Destination Unreachable (2620:0:862:1:a6ba:dbff:fe30:d0df) [10:05:02] (03PS2) 10Jcrespo: mariadb-autoinstall: Return reimage configuration state back to normal [puppet] - 10https://gerrit.wikimedia.org/r/425988 [10:06:48] (03PS1) 10Alexandros Kosiaris: Add mobileapps to contacts for mobileapps LVS service [puppet] - 10https://gerrit.wikimedia.org/r/425991 (https://phabricator.wikimedia.org/T189524) [10:11:05] (03PS1) 10Jcrespo: mariadb: Repool es1013 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425992 [10:12:17] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1947 bytes in 0.114 second response time [10:13:01] (03CR) 10Jcrespo: [C: 032] mariadb-autoinstall: Return reimage configuration state back to normal [puppet] - 10https://gerrit.wikimedia.org/r/425988 (owner: 10Jcrespo) [10:16:16] !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=nescio.wikimedia.org,service=pdns_recursor [10:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.095 second response time [10:18:10] (03PS2) 10Alexandros Kosiaris: Add mobileapps to contacts for mobileapps LVS service [puppet] - 10https://gerrit.wikimedia.org/r/425991 (https://phabricator.wikimedia.org/T189524) [10:19:34] (03PS1) 10Lokal Profil: Allow format to be overridden in mediatype object [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425993 (https://phabricator.wikimedia.org/T154914) [10:20:05] 10Operations, 10Traffic: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129314 (10Vgutierrez) [10:20:08] 10Operations, 10Traffic: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#3964044 (10Vgutierrez) [10:21:23] (03PS1) 10Vgutierrez: install_server: Reimage maerlant.wikimedia.org as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425994 (https://phabricator.wikimedia.org/T187090) [10:23:04] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4129321 (10akosiaris) >>! In T189524#4127591, @bearND wrote: > Same here. Tried with `bearND` but same result. It's `bearnd` (a... [10:23:06] (03PS1) 10Muehlenhoff: Switch remaining app server and API canaries to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425995 [10:23:52] https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/181_15l.jpg/640px-181_15l.jpg [10:24:07] Request from 240a:6b:510:efb8:41a5:bf39:2257:bf1b via cp1063 cp1063, Varnish XID 652598903 [10:24:15] Error: 404, Not Found at Fri, 13 Apr 2018 09:54:36 GMT [10:24:32] https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/181_15l.jpg/1280px-181_15l.jpg [10:24:39] this one is OK ^ [10:26:58] should I open a report? [10:27:04] yannf: both work for me? [10:27:23] for me as well [10:27:26] RECOVERY - HHVM processes on mw1277 is OK: PROCS OK: 6 processes with command name hhvm [10:27:47] RECOVERY - Disk space on mw1277 is OK: DISK OK [10:27:47] RECOVERY - High CPU load on API appserver on mw1277 is OK: OK - load average: 3.66, 4.15, 2.90 [10:27:47] RECOVERY - dhclient process on mw1277 is OK: PROCS OK: 0 processes with command name dhclient [10:28:05] (03PS1) 10Mobrovac: Refresh mobrovac's SSH keys (step 1/2) [puppet] - 10https://gerrit.wikimedia.org/r/425997 [10:28:06] RECOVERY - nutcracker port on mw1277 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:28:51] I had to purge the page 3 times :/ [10:28:57] now it works [10:29:27] (03PS2) 10Muehlenhoff: Switch remaining app server and API canaries to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425995 [10:31:02] (03CR) 10Muehlenhoff: [C: 032] Switch remaining app server and API canaries to stretch [puppet] - 10https://gerrit.wikimedia.org/r/425995 (owner: 10Muehlenhoff) [10:33:27] (03PS1) 10Alexandros Kosiaris: Deprecate --autoload in uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/425998 (https://phabricator.wikimedia.org/T192102) [10:33:57] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage maerlant.wikimedia.org as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425994 (https://phabricator.wikimedia.org/T187090) (owner: 10Vgutierrez) [10:34:03] (03PS2) 10Vgutierrez: install_server: Reimage maerlant.wikimedia.org as stretch [puppet] - 10https://gerrit.wikimedia.org/r/425994 (https://phabricator.wikimedia.org/T187090) [10:38:37] (03PS1) 10Volans: puppetboard: notify service on settings change [puppet] - 10https://gerrit.wikimedia.org/r/425999 [10:38:59] !log Depool and reimage maerlant.wikimedia.org as stretch [10:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:11] !log vgutierrez@neodymium conftool action : set/pooled=no; selector: name=maerlant.wikimedia.org,service=pdns_recursor [10:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:26] RECOVERY - nutcracker port on mw1276 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:41:36] RECOVERY - HHVM processes on mw1276 is OK: PROCS OK: 6 processes with command name hhvm [10:42:06] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 2.51, 2.29, 2.37 [10:42:06] RECOVERY - Disk space on mw1276 is OK: DISK OK [10:42:06] RECOVERY - dhclient process on mw1276 is OK: PROCS OK: 0 processes with command name dhclient [10:42:51] (03PS2) 10Lokal Profil: Allow format to be overridden in mediatype object [dumps/dcat] - 10https://gerrit.wikimedia.org/r/425993 (https://phabricator.wikimedia.org/T154914) [10:43:45] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129338 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` maerlant.wikimedia.org ``` The log can be found in `/var/log/wmf-aut... [10:45:07] RECOVERY - HHVM rendering on mw1277 is OK: HTTP OK: HTTP/1.1 200 OK - 79409 bytes in 0.244 second response time [10:45:46] PROBLEM - Host 2620:0:862:1:91:198:174:122 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:21] ^ that's maerlant being reimaged [10:47:33] 10Operations, 10Mobile-Content-Service, 10Parsing-Team, 10Reading-Infrastructure-Team-Backlog, and 4 others: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#4129361 (10mobrovac) [10:48:07] PROBLEM - Recursive DNS on 91.198.174.122 is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:49:38] <_joe_> uh? [10:49:43] <_joe_> what's up with recdns? [10:49:47] reimaing [10:49:54] *reimaging [10:49:54] <_joe_> heh [10:49:55] <_joe_> ok [10:52:36] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Graphite (or Prometheus) - https://phabricator.wikimedia.org/T131894#4129365 (10Gilles) a:03Gilles [10:52:58] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-Incident: Collect Backend-Timing in Graphite (or Prometheus) - https://phabricator.wikimedia.org/T131894#2182123 (10Gilles) [10:55:00] (03PS8) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [10:55:36] (03CR) 10jerkins-bot: [V: 04-1] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [10:55:42] (03CR) 10Fdans: Puppetize cron job archiving old MaxMind databases (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [10:58:03] (03PS1) 10ArielGlenn: stop dumps-related cron jobs on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/426003 (https://phabricator.wikimedia.org/T188643) [10:59:54] !log reimaging mw1261-mw1264 to stretch (T174431) [11:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:01] T174431: Upgrade mw* servers to Debian Stretch (using HHVM) - https://phabricator.wikimedia.org/T174431 [11:03:35] (03CR) 10ArielGlenn: [C: 032] stop dumps-related cron jobs on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/426003 (https://phabricator.wikimedia.org/T188643) (owner: 10ArielGlenn) [11:07:05] elukey: the profile that the archive class is included in assumes that /srv exists, shouldn't we make that assumption in the module too? [11:07:06] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/statistics/private.pp#L31 [11:10:57] <_joe_> fdans: /srv is part of the base FHS of any linux distro [11:11:02] <_joe_> any sane one [11:11:24] <_joe_> so unless you need to change its default ownership, there is no point to declaring it in puppet [11:13:36] (03PS1) 10Hashar: Introduce tox to setup venv and fix flake8 issues [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/426004 [11:13:59] yup... it's out there since FHS-2.3, http://refspecs.linuxfoundation.org/FHS_2.3/fhs-2.3.html#SRVDATAFORSERVICESPROVIDEDBYSYSTEM [11:14:35] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129429 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maerlant.wikimedia.org'] ``` Of which those **FAILED**: ``` ['maerlant.wikimedia.org'] ``` [11:20:56] (03PS2) 10Muehlenhoff: Refresh mobrovac's SSH keys (step 1/2) [puppet] - 10https://gerrit.wikimedia.org/r/425997 (owner: 10Mobrovac) [11:21:35] (03CR) 10Muehlenhoff: [C: 032] Refresh mobrovac's SSH keys (step 1/2) [puppet] - 10https://gerrit.wikimedia.org/r/425997 (owner: 10Mobrovac) [11:26:57] 10Operations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371#4129456 (10MoritzMuehlenhoff) [11:31:51] (03PS1) 10Mobrovac: Refresh mobrovac's SSH keys (step 2/2) [puppet] - 10https://gerrit.wikimedia.org/r/426007 [11:31:53] PROBLEM - Check size of conntrack table on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:31:53] PROBLEM - dhclient process on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:31:53] PROBLEM - HHVM processes on mw1278 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:31:53] PROBLEM - mediawiki-installation DSH group on mw1278 is CRITICAL: Host mw1278 is not in mediawiki-installation dsh group [11:32:42] (03CR) 10Mobrovac: [C: 04-1] "Let's wait Monday to get this in." [puppet] - 10https://gerrit.wikimedia.org/r/426007 (owner: 10Mobrovac) [11:33:13] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:33:33] PROBLEM - HHVM rendering on mw1278 is CRITICAL: connect to address 10.64.0.73 and port 80: Connection refused [11:33:33] PROBLEM - nutcracker port on mw1278 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:33:34] PROBLEM - Recursive DNS on 2620:0:862:1:a6ba:dbff:fe30:d112 is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:35:13] PROBLEM - nutcracker process on mw1278 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:14] PROBLEM - Recursive DNS on 91.198.174.122 is CRITICAL: CRITICAL - Plugin timed out while executing system call [11:35:33] RECOVERY - Recursive DNS on 2620:0:862:1:a6ba:dbff:fe30:d112 is OK: DNS OK: 0.123 seconds response time. www.wikipedia.org returns 208.80.154.224 [11:35:33] PROBLEM - Check systemd state on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:53] PROBLEM - dhclient process on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:35:54] PROBLEM - Check size of conntrack table on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:36:03] RECOVERY - Recursive DNS on 91.198.174.122 is OK: DNS OK: 0.091 seconds response time. www.wikipedia.org returns 208.80.154.224 [11:36:13] PROBLEM - Check whether ferm is active by checking the default input chain on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:36:53] PROBLEM - DPKG on maerlant is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:36:54] PROBLEM - puppet last run on mw1278 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:37:53] RECOVERY - Check size of conntrack table on maerlant is OK: OK: nf_conntrack is 0 % full [11:37:54] RECOVERY - dhclient process on maerlant is OK: PROCS OK: 0 processes with command name dhclient [11:37:54] RECOVERY - DPKG on maerlant is OK: All packages OK [11:38:13] RECOVERY - Check whether ferm is active by checking the default input chain on maerlant is OK: OK ferm input default policy is set [11:38:24] PROBLEM - Host 2620:0:862:1:a6ba:dbff:fe30:d112 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:23] PROBLEM - Host 2620:0:862:1:a6ba:dbff:fe30:d112 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:34] RECOVERY - Check systemd state on maerlant is OK: OK - running: The system is fully operational [11:50:53] (03CR) 10Jcrespo: [V: 032 C: 032] Introduce tox to setup venv and fix flake8 issues [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/426004 (owner: 10Hashar) [11:51:27] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es1013 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425992 (owner: 10Jcrespo) [11:51:50] (03PS4) 10Muehlenhoff: mediawiki::packages::fonts: Consistently use require_package [puppet] - 10https://gerrit.wikimedia.org/r/420670 [11:52:54] (03Merged) 10jenkins-bot: mariadb: Repool es1013 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425992 (owner: 10Jcrespo) [11:53:12] (03CR) 10jenkins-bot: mariadb: Repool es1013 with low weight after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/425992 (owner: 10Jcrespo) [11:56:47] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1013 with low load (duration: 01m 04s) [11:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:37] !log vgutierrez@neodymium conftool action : set/pooled=yes; selector: name=maerlant.wikimedia.org,service=pdns_recursor [11:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:14] RECOVERY - HHVM processes on mw1278 is OK: PROCS OK: 6 processes with command name hhvm [12:04:34] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 5.39, 5.45, 3.59 [12:08:29] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 79387 bytes in 8.351 second response time [12:10:38] RECOVERY - nutcracker process on mw1278 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:12:45] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 85550.07 seconds [12:15:16] (03PS2) 10Alexandros Kosiaris: Deprecate --autoload in uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/425998 (https://phabricator.wikimedia.org/T192102) [12:15:18] (03PS1) 10Alexandros Kosiaris: coal: Remove redundant uwsgi::app parameter [puppet] - 10https://gerrit.wikimedia.org/r/426010 (https://phabricator.wikimedia.org/T192102) [12:15:20] (03PS1) 10Alexandros Kosiaris: encapi: Remove redundant uwsgi::app parameter [puppet] - 10https://gerrit.wikimedia.org/r/426011 (https://phabricator.wikimedia.org/T192102) [12:15:22] (03PS1) 10Alexandros Kosiaris: dynamicproxy: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426012 (https://phabricator.wikimedia.org/T192102) [12:15:24] (03PS1) 10Alexandros Kosiaris: graphite::web: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426013 (https://phabricator.wikimedia.org/T192102) [12:15:26] (03PS1) 10Alexandros Kosiaris: ifttt: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426014 (https://phabricator.wikimedia.org/T192102) [12:15:28] (03PS1) 10Alexandros Kosiaris: quarry::web: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426015 (https://phabricator.wikimedia.org/T192102) [12:15:30] (03PS1) 10Alexandros Kosiaris: service::uwsgi: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426016 (https://phabricator.wikimedia.org/T192102) [12:15:32] (03PS1) 10Alexandros Kosiaris: wikilabels::web: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426017 (https://phabricator.wikimedia.org/T192102) [12:15:35] (03PS1) 10Alexandros Kosiaris: wikimetrics::web: Stop using --autoload in uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/426018 (https://phabricator.wikimedia.org/T192102) [12:16:11] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 79385 bytes in 0.144 second response time [12:17:01] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:18:00] RECOVERY - nutcracker port on mw1278 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:18:12] dbstore1002 is an ongoing schema change, acking [12:24:18] 10Operations, 10Puppet, 10Patch-For-Review: deprecate and remove --autoload in uwsgi puppet class - https://phabricator.wikimedia.org/T192102#4129510 (10akosiaris) [12:29:52] 10Operations, 10Traffic, 10Patch-For-Review: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#4129515 (10Vgutierrez) [12:31:54] RECOVERY - mediawiki-installation DSH group on mw1278 is OK: OK [12:36:15] !log installing apache security updates on bohrium (piwik) [12:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:15] !log installing apache security updates on mendelevium (otrs) [12:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:15] (03CR) 10Hashar: "CI added with https://gerrit.wikimedia.org/r/#/c/426023/" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/426004 (owner: 10Hashar) [12:39:14] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1973 bytes in 0.114 second response time [12:44:14] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1972 bytes in 0.102 second response time [12:45:23] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint: sudoer access for pnorman on maps servers - https://phabricator.wikimedia.org/T192115#4129546 (10Gehel) Similar to what we have for other services (eg: [[ https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L93-L97 | elasticsear... [12:46:00] (03PS9) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [12:46:10] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/426024 [12:47:16] (03CR) 10Hashar: "recheck" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/426004 (owner: 10Hashar) [12:47:24] (03PS3) 10Gehel: maps: run populate_admin() regularly [puppet] - 10https://gerrit.wikimedia.org/r/425524 (https://phabricator.wikimedia.org/T190605) [12:47:26] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/426024 (owner: 10Hashar) [12:47:40] (03CR) 10Elukey: Puppetize cron job archiving old MaxMind databases (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [12:54:15] (03PS10) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [12:54:22] (03CR) 10Fdans: Puppetize cron job archiving old MaxMind databases (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [12:54:46] (03CR) 10jerkins-bot: [V: 04-1] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [12:57:37] (03PS11) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [12:59:08] RECOVERY - mediawiki-installation DSH group on mw1276 is OK: OK [12:59:08] RECOVERY - mediawiki-installation DSH group on mw1277 is OK: OK [13:01:00] fdans: our puppet style guide states indentation is 4 spaces: https://wikitech.wikimedia.org/wiki/Puppet_coding#Spacing,_Indentation,_&_Whitespace [13:02:07] PROBLEM - DPKG on mw1261 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:02:08] PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group [13:02:08] PROBLEM - dhclient process on mw1261 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:02:08] PROBLEM - Disk space on mw1262 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:02:08] PROBLEM - HHVM processes on mw1263 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:02:08] PROBLEM - nutcracker port on mw1263 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:02:48] (03PS12) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [13:03:10] ^ reimage race, silencing [13:03:20] (03CR) 10jerkins-bot: [V: 04-1] Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [13:03:48] PROBLEM - HHVM rendering on mw1263 is CRITICAL: connect to address 10.64.0.58 and port 80: Connection refused [13:03:48] PROBLEM - mediawiki-installation DSH group on mw1261 is CRITICAL: Host mw1261 is not in mediawiki-installation dsh group [13:03:48] PROBLEM - Disk space on mw1261 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:48] PROBLEM - HHVM processes on mw1262 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:48] PROBLEM - nutcracker port on mw1262 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:03:48] PROBLEM - nutcracker process on mw1263 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:04:31] fdans: may I suggest some style changes, beyond spacing? [13:04:47] sure! [13:05:28] PROBLEM - HHVM rendering on mw1262 is CRITICAL: connect to address 10.64.0.57 and port 80: Connection refused [13:05:28] PROBLEM - HHVM processes on mw1261 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:05:28] PROBLEM - nutcracker port on mw1261 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:05:28] PROBLEM - nutcracker process on mw1262 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:05:37] PROBLEM - puppet last run on mw1263 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:18:59] !log increasing heap size to 16G -- T186751 [13:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:06] T186751: Reset RESTBase dev environment - https://phabricator.wikimedia.org/T186751 [13:21:57] RECOVERY - Disk space on mw1261 is OK: DISK OK [13:22:14] (03PS1) 10Muehlenhoff: Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 [13:22:17] RECOVERY - dhclient process on mw1261 is OK: PROCS OK: 0 processes with command name dhclient [13:22:17] RECOVERY - DPKG on mw1261 is OK: All packages OK [13:22:18] RECOVERY - HHVM processes on mw1263 is OK: PROCS OK: 6 processes with command name hhvm [13:22:37] RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 1 process with command name hhvm [13:22:39] (03CR) 10jerkins-bot: [V: 04-1] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:23:14] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426028 [13:23:22] * moritzm shakes fist at the pointless commit message CI check [13:23:23] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426028 [13:23:58] RECOVERY - HHVM processes on mw1262 is OK: PROCS OK: 6 processes with command name hhvm [13:24:17] RECOVERY - Disk space on mw1262 is OK: DISK OK [13:24:29] (03PS2) 10Muehlenhoff: Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 [13:24:56] (03CR) 10jerkins-bot: [V: 04-1] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:25:08] (03CR) 10Elukey: [C: 031] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:25:24] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426028 (owner: 10Marostegui) [13:26:43] (03PS3) 10Muehlenhoff: Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 [13:26:48] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426028 (owner: 10Marostegui) [13:27:10] (03CR) 10jerkins-bot: [V: 04-1] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:27:17] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1947 bytes in 0.108 second response time [13:28:07] RECOVERY - nutcracker process on mw1263 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [13:28:26] RECOVERY - nutcracker port on mw1263 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:28:28] (03PS4) 10Muehlenhoff: Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 [13:28:52] (03CR) 10jerkins-bot: [V: 04-1] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:29:26] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426028 (owner: 10Marostegui) [13:29:27] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1104 after alter table (duration: 01m 02s) [13:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:24] (03PS1) 10Jcrespo: mariadb: Return es1013 to be fully pooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426030 [13:30:33] (03PS2) 10Jcrespo: mariadb: Return es1013 to be fully pooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426030 [13:30:37] RECOVERY - nutcracker port on mw1261 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:31:07] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 79431 bytes in 7.820 second response time [13:32:04] !log restart druid and zookeeper daemons on druid100[456] for opejdk-7 updates [13:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:24] (03PS5) 10Muehlenhoff: Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 [13:32:47] (03CR) 10jerkins-bot: [V: 04-1] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:33:46] RECOVERY - nutcracker process on mw1262 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [13:33:56] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 79431 bytes in 7.276 second response time [13:34:06] RECOVERY - nutcracker port on mw1262 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:35:27] (03CR) 10Jcrespo: [C: 032] mariadb: Return es1013 to be fully pooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426030 (owner: 10Jcrespo) [13:35:36] RECOVERY - puppet last run on mw1263 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:35:44] (03CR) 10jenkins-bot: mariadb: Return es1013 to be fully pooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426030 (owner: 10Jcrespo) [13:36:52] (03PS6) 10Muehlenhoff: Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 [13:37:19] (03CR) 10jerkins-bot: [V: 04-1] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:38:19] (03CR) 10Muehlenhoff: [V: 032 C: 032] Add ivy-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426027 (owner: 10Muehlenhoff) [13:42:17] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1974 bytes in 0.115 second response time [13:42:50] (03PS13) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) [13:48:08] (03PS1) 10Muehlenhoff: Add maven-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426034 [13:48:51] (03CR) 10Muehlenhoff: [C: 032] Add maven-debian-helper to package list [puppet] - 10https://gerrit.wikimedia.org/r/426034 (owner: 10Muehlenhoff) [13:49:32] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1013 with full weight (duration: 01m 00s) [13:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:24] !log roll restart druid + zookeeper daemons on druid100[123] for openjdk-7 updates [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:06] (03PS1) 10Andrew Bogott: Horizon: put in maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/426042 (https://phabricator.wikimedia.org/T145919) [14:06:26] !log uploaded ivy-debian-helper to apt.wikimedia.org/jessie (needed for zookeeper backport) [14:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:55] (03PS1) 10Andrew Bogott: Openstack version -> Mitaka almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/426044 (https://phabricator.wikimedia.org/T145919) [14:09:34] !log silencing alerts for labcontrol*, labnet*, labservices*, labvirt* before beginning T145919 [14:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:40] T145919: Upgrade Labs to OpenStack Mitaka - https://phabricator.wikimedia.org/T145919 [14:10:58] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/425976 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:12:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426047 (https://phabricator.wikimedia.org/T191996) [14:13:08] !log installing apache security updates on contint1001 [14:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:16] !log disabling puppet on labcontrol*, labnet*, labservices*, labvirt* before beginning T145919 [14:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426047 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:14:28] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10922/ says it's fine so merging" [puppet] - 10https://gerrit.wikimedia.org/r/425998 (https://phabricator.wikimedia.org/T192102) (owner: 10Alexandros Kosiaris) [14:14:34] (03PS3) 10Alexandros Kosiaris: Deprecate --autoload in uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/425998 (https://phabricator.wikimedia.org/T192102) [14:15:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426047 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:15:27] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.38, 35.47, 30.51 [14:15:42] (03CR) 10Andrew Bogott: [C: 032] Horizon: put in maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/426042 (https://phabricator.wikimedia.org/T145919) (owner: 10Andrew Bogott) [14:15:49] (03PS4) 10Alexandros Kosiaris: Deprecate --autoload in uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/425998 (https://phabricator.wikimedia.org/T192102) [14:15:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Deprecate --autoload in uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/425998 (https://phabricator.wikimedia.org/T192102) (owner: 10Alexandros Kosiaris) [14:16:30] andrewbogott: I 've merged yours as well [14:16:37] akosiaris: thanks [14:16:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1104 - T191996 (duration: 00m 59s) [14:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:47] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:19:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426047 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:22:40] !log enable flow control on db1114's switch port - T191996 [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:46] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:26:25] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426050 [14:26:42] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426050 (owner: 10Marostegui) [14:26:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426050 (owner: 10Marostegui) [14:28:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1104 - T191996 (duration: 01m 07s) [14:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:50] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [14:28:59] 10Operations, 10DBA, 10netops: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129750 (10jcrespo) Adding the tag to reflect work done at network layer. [14:29:38] (03PS1) 10Marostegui: db-eqiad.php: db1114 increase weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426051 (https://phabricator.wikimedia.org/T191996) [14:30:19] (03CR) 10Marostegui: [C: 04-1] "wait a bit until traffic stabilizes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426051 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [14:31:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426050 (owner: 10Marostegui) [14:34:46] (03PS2) 10Andrew Bogott: Openstack version -> Mitaka almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/426044 (https://phabricator.wikimedia.org/T145919) [14:34:47] (03PS1) 10Andrew Bogott: Horizon: fix maintenance mode setting to be 'main' v. 'labtest' [puppet] - 10https://gerrit.wikimedia.org/r/426053 (https://phabricator.wikimedia.org/T145919) [14:34:59] (03CR) 10Andrew Bogott: [C: 032] Horizon: fix maintenance mode setting to be 'main' v. 'labtest' [puppet] - 10https://gerrit.wikimedia.org/r/426053 (https://phabricator.wikimedia.org/T145919) (owner: 10Andrew Bogott) [14:36:54] (03CR) 10Andrew Bogott: [C: 032] Openstack version -> Mitaka almost everywhere [puppet] - 10https://gerrit.wikimedia.org/r/426044 (https://phabricator.wikimedia.org/T145919) (owner: 10Andrew Bogott) [14:37:28] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.33, 35.70, 31.96 [14:38:08] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) [14:38:40] !log stopping puppet and nodepool on labnodepool1001 [14:38:44] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:14] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for zuul-merger [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) [14:41:28] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[python3-novaclient] [14:42:23] !log upgrading lots of packages on labcontrol1001 and 1002 and rebooting. T145919 [14:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:29] T145919: Upgrade Labs to OpenStack Mitaka - https://phabricator.wikimedia.org/T145919 [14:45:28] PROBLEM - DPKG on labcontrol1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:45:47] PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:46:33] PROBLEM - keystone admin endpoint port 35357 on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 35357: Connection refused [14:46:40] PROBLEM - keystone public endoint port 5000 on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 5000: Connection refused [14:46:40] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:46:52] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [14:47:38] RECOVERY - keystone admin endpoint port 35357 on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 783 bytes in 0.006 second response time [14:47:47] RECOVERY - keystone public endoint port 5000 on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.003 second response time [14:47:47] RECOVERY - DPKG on labcontrol1002 is OK: All packages OK [14:47:48] andrewbogott: keystone pages, is that due to labcontrol reboots? [14:48:01] ema: yes. I dowtimed that host, I don't know why it's still alerting [14:48:05] Guess I'll downtime it again! [14:48:10] the recoveries are coming in now [14:49:22] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 531 bytes in 0.007 second response time [14:59:33] !log rebooting labcontrol1001 [15:00:52] Hey there. Could I ask someone to merge https://gerrit.wikimedia.org/r/#/c/424595/ -- it's a change to the navtiming.py script that Performance deploys via puppet. [15:01:46] I'm trying to move the script out of puppet, and am trying to clear outstanding changesets before doing so. [15:01:51] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:02] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:02:11] RECOVERY - mediawiki-installation DSH group on mw1262 is OK: OK [15:03:51] RECOVERY - mediawiki-installation DSH group on mw1261 is OK: OK [15:04:12] marlier: it's not a good time now, see Andrew's mail to wikitech-l, they're upgrading OpenStack and that also affects CI [15:04:45] Ah, sorry, I didn't realize that was actively ongoing. [15:05:54] we are working in some NFS issues in toolforge, in case you see some failing tools or related stuff [15:06:31] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 0.021 second response time [15:08:06] 10Operations, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129814 (10ayounsi) ```name=db1114 ethtool eno1 Supported pause frame use: No Advertised pause frame use: Symmetric Link partner advertised pause frame use: No ``` ```name=db1114's switch... [15:09:01] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:16] !log labstore1004 stop nfs-exportd, cp export.bak to export.d, exportfs -ra (all exports were wiped out) [15:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:23] 10Operations, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4129816 (10Marostegui) @ayounsi thanks for your help. If you want to compare it with the other two servers that receive exactly the same traffic, those are: db1066 and db1080. [15:09:31] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is inactive [15:09:43] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4129817 (10bearND) >>! In T189524#4129321, @akosiaris wrote: > It's `bearnd` (all lowercase), per https://phabricator.wikimedia.... [15:09:51] marlier: I could merge that for you later on today or early next week depending on when the maintenance is complete and CI back to normal [15:10:24] That would be great, herron. It's a totally safe change, so any time is fine. [15:10:50] kk cool will do [15:11:31] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:33] !log wiki replicas: added page_assessments views for frwiki & huwiki [15:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:05] musikanimal: ^^ both page_assessments tables should be live for frwiki & huwiki [15:16:48] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4129825 (10akosiaris) >>! In T189524#4129817, @bearND wrote: >>>! In T189524#4129321, @akosiaris wrote: >> It's `bearnd` (all lo... [15:18:47] <_joe_> the puppet CI is untouched by nodepool not being available btw [15:18:48] (03PS1) 10Arturo Borrero Gonzalez: wmcs: labstore1004 and labstore1005 should keep using liberty [puppet] - 10https://gerrit.wikimedia.org/r/426082 (https://phabricator.wikimedia.org/T145919) [15:18:52] <_joe_> or it should be [15:19:04] <_joe_> so ian's patch should be mergeable IMHO [15:20:42] I have a few if you want me to stuff it through [15:21:03] (03CR) 10Arturo Borrero Gonzalez: "Checking catalog compilation: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10924/" [puppet] - 10https://gerrit.wikimedia.org/r/426082 (https://phabricator.wikimedia.org/T145919) (owner: 10Arturo Borrero Gonzalez) [15:23:03] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4129858 (10Nuria) Could we restore proxies now that the nice opera folks gave us their list? Clearly we also need to look into why/ho... [15:24:12] (03CR) 10Arturo Borrero Gonzalez: "> Checking catalog compilation: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/10924/" [puppet] - 10https://gerrit.wikimedia.org/r/426082 (https://phabricator.wikimedia.org/T145919) (owner: 10Arturo Borrero Gonzalez) [15:24:16] (03CR) 10Arturo Borrero Gonzalez: [C: 032] wmcs: labstore1004 and labstore1005 should keep using liberty [puppet] - 10https://gerrit.wikimedia.org/r/426082 (https://phabricator.wikimedia.org/T145919) (owner: 10Arturo Borrero Gonzalez) [15:24:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few comments inline, scripts LGTM otherwise" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [15:26:05] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4129878 (10BBlack) Yeah, ema and I discussed this after the meeting the other day. I'm not sure whether or how we can look into the... [15:26:42] (03PS4) 10BBlack: navtiming: Remove broken 'rendering' and 'pageSpeed' metrics [puppet] - 10https://gerrit.wikimedia.org/r/424595 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [15:27:12] !log rebooting lots of packages on labnet1001 and labnet1002 for T145919 [15:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:18] T145919: Upgrade Labs to OpenStack Mitaka - https://phabricator.wikimedia.org/T145919 [15:27:19] (03CR) 10BBlack: [C: 032] navtiming: Remove broken 'rendering' and 'pageSpeed' metrics [puppet] - 10https://gerrit.wikimedia.org/r/424595 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [15:27:39] marlier: merged-up! [15:27:52] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4129884 (10Nuria) >we're planning to just stop pulling that empty data from them, and replace it with a private file that's puppet-man... [15:27:54] Awesome, thanks much bblack [15:29:01] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:30:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A single inline comment. rest LGTM" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [15:31:52] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [15:32:26] (03CR) 10Alexandros Kosiaris: "I am not entirely sure about this approach. Django has its own testing framework. Why exactly do we avoid that and use pytest ?" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [15:32:53] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4129893 (10Mholloway) I was able to add a (non-persistent) test comment to `Mobileapps LVS eqiad` as well. [15:33:28] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4129902 (10mobrovac) [15:33:33] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (done): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#4129898 (10mobrovac) 05stalled>03Resolved a:03mobrovac We have decided to put it on Ganeti for now, so I'm resolving this task. @Niedzielski I... [15:34:31] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.137 second response time [15:35:15] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (done): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#4129916 (10pmiazga) @mobrovac that's a great news. [15:42:01] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [15:43:46] !log restarting nodepool on labnodepool1001 [15:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:08] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (done): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#4129941 (10Niedzielski) Thanks so much @mobrovac!! Feel free to do whatever you want to the patches and let us know if help is needed. Thanks again! [15:49:59] (03CR) 10Krinkle: [C: 031] "Ready for deploy." [puppet] - 10https://gerrit.wikimedia.org/r/420831 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [15:50:13] bblack: Could you roll this one out as well? ^ [15:50:20] !log upgrading lots of packages and rebooting labservices1002 and 1002 [15:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:58] Krinkle: if I merge https://gerrit.wikimedia.org/r/#/c/425858/ could you test it? [15:54:57] Sure [15:55:16] (03PS3) 10Jcrespo: beta: Combine commons, deployments, meta and zero vhost (2) [puppet] - 10https://gerrit.wikimedia.org/r/425858 [15:55:16] jynus: What broke last time? [15:55:22] Krinkle: nothing [15:55:28] beta was broken [15:55:30] Okay [15:55:46] so I didn't want to add to the broken-ness [15:55:49] jynus: I'm here btw [15:55:52] ah [15:56:06] well, only 1 of you is enough [15:56:09] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4129948 (10bearND) @akosiaris Got it and deleted it. Thanks! [15:56:09] :-) [15:56:11] :) [15:56:13] you can fight [15:56:31] (03CR) 10Jcrespo: [C: 032] beta: Combine commons, deployments, meta and zero vhost (2) [puppet] - 10https://gerrit.wikimedia.org/r/425858 (owner: 10Jcrespo) [15:57:15] jynus: ^^^^ yay <3 [15:57:25] not me [15:57:30] If you've got itches about that, I got patches <3 [15:57:32] I think it was eddiegp work [15:57:42] I'll just have my eyes open whether it works as well. Two people doing that can't hurt ;) [15:57:49] Yep, that was me. [15:57:55] so all credits to him [15:58:03] I just push a key [15:58:19] you can pull now/runn puppet [15:58:33] https://gerrit.wikimedia.org/r/#/c/422571/ https://gerrit.wikimedia.org/r/#/c/421949/ https://gerrit.wikimedia.org/r/#/c/402090/ [15:58:41] All need some love :) [15:59:25] I am going to guess those now conflict? [15:59:50] (03PS3) 10Chad: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571 [15:59:50] I attended eddiegp's one first because it was on puppet swat [15:59:59] That one cleanly rebased [16:00:09] jynus: Is it applied on beta? [16:00:10] (03PS2) 10Chad: Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 [16:00:20] Krinkle: I just merged on production puppet [16:00:26] As did that one (which is the easiest one) [16:00:39] Third one probably conflicts. [16:00:45] (03PS3) 10Chad: Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) [16:00:51] Oh wow, it rebased nicely [16:01:02] no_justification: ok, let's test this first one, which was due for yesterday [16:01:37] Running puppet on the beta appservers [16:01:40] I may tell you to schedule for tuesday, as I was already about to leave [16:02:17] It's applied [16:02:50] beta still working, all affected sites? [16:03:53] mmm, I get some weird redirects [16:04:07] !log cleaning up lost instances in nodepool (nodepool delete XXXXX) [16:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:33] (03PS1) 10Andrew Bogott: designate: add a pool config for the main deploy [puppet] - 10https://gerrit.wikimedia.org/r/426100 (https://phabricator.wikimedia.org/T187954) [16:05:35] but most likely those are not apache, but mediawiki tests [16:06:00] (03CR) 10jerkins-bot: [V: 04-1] designate: add a pool config for the main deploy [puppet] - 10https://gerrit.wikimedia.org/r/426100 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [16:06:03] Yeah, something looks off. [16:06:17] mmm [16:06:26] zero and meta mixed, maybe? [16:07:13] I am sorry, that is why I asked for help on testing- I am not too familiar with beta to understand if that is normal [16:07:33] (03PS2) 10Andrew Bogott: designate: add a pool config for the main deploy [puppet] - 10https://gerrit.wikimedia.org/r/426100 (https://phabricator.wikimedia.org/T187954) [16:07:53] The difference is that zero.wikimedia... is now the ServerName. [16:07:59] And the others are just ServerAlias [16:08:11] no, the change and apache I know [16:08:31] It's not normal. [16:08:31] it is what beta is supposed to do what I am unfamiliar [16:08:43] eddiegp: I am going to revert, ok? [16:08:49] get the logs you need [16:08:54] Yeah, I think we should do that. [16:09:03] which should be enough with my browsing and yours [16:09:09] and I am reverting [16:09:12] sorry again [16:09:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.080 second response time [16:09:42] (03CR) 10Andrew Bogott: [C: 032] designate: add a pool config for the main deploy [puppet] - 10https://gerrit.wikimedia.org/r/426100 (https://phabricator.wikimedia.org/T187954) (owner: 10Andrew Bogott) [16:10:06] as the wise Giuseppe once said "apache config is not easy and it is very easy to break stuff" [16:10:14] Not your fault ;) [16:10:41] I think something down the line somehow fetches the ServerName as a variable and does some branching based on it. [16:10:46] eddiegp: my suggestion as this is not trivial is to create a task [16:10:59] refer to your patch and no_justification ones [16:11:02] and work there [16:11:08] I may help, but not today [16:11:19] Yeah, sure. [16:11:27] (03PS1) 10Bstorm: dotfiles: add bstorm dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/426101 [16:11:47] (03PS1) 10Jcrespo: Revert "beta: Combine commons, deployments, meta and zero vhost (2)" [puppet] - 10https://gerrit.wikimedia.org/r/426102 [16:11:56] I don't see an issue on beta. [16:11:57] (03PS2) 10Jcrespo: Revert "beta: Combine commons, deployments, meta and zero vhost (2)" [puppet] - 10https://gerrit.wikimedia.org/r/426102 [16:12:00] What is mixed up? [16:12:12] Krinkle: Go to commons.beta. [16:12:16] Or meta.beta. [16:12:20] 'Welcome to Wikimedia Commons BETA,' [16:12:28] 'Welcome to the Meta Wiki' [16:12:30] (03CR) 10Bstorm: [C: 032] dotfiles: add bstorm dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/426101 (owner: 10Bstorm) [16:12:30] Yes [16:12:48] so you are ok with reverting or not? [16:12:50] If I go to http://commons.wikimedia.beta.wmflabs.org/ [16:12:58] I'm redirected to https://zero.wikimedia.beta.wmflabs.org/wiki/Special:ZeroPortal [16:13:11] Doesn't happen for me. [16:13:23] What I do see is that the logo is wrong, which could be due to something else perhaps [16:13:24] jynus: Revert [16:13:26] could it be some test on zero at mediawiki level only? [16:13:33] It's varnish [16:13:41] I turned on X-Wikimedia-Debug [16:13:50] Ah, good point [16:13:51] which, also in beta, will bring you directly to the appservers [16:13:55] Yeah, they're all zerowiki now [16:13:56] OK [16:14:07] ok, so 2 votes for revert? [16:14:07] The working version is still cached, but the appservers do the wrong thing [16:14:09] So rever [16:14:11] Yep [16:14:15] ok, doing [16:14:24] cache is confusing :-) [16:14:46] (03CR) 10Jcrespo: [C: 032] Revert "beta: Combine commons, deployments, meta and zero vhost (2)" [puppet] - 10https://gerrit.wikimedia.org/r/426102 (owner: 10Jcrespo) [16:14:56] (03PS3) 10Jcrespo: Revert "beta: Combine commons, deployments, meta and zero vhost (2)" [puppet] - 10https://gerrit.wikimedia.org/r/426102 [16:15:26] (03CR) 10Hashar: [C: 031] "zuul-merger can be restarted anytime (unlike the zuul daemon)." [puppet] - 10https://gerrit.wikimedia.org/r/426055 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:15:27] eddiegp: production uses a non-realistic ServerName with all real ones in ServerAlias [16:15:42] e.g. 'ServerName wiktionary ServerAlias *.wiktionary.org' [16:15:58] Maybe next time we can try that here too, e.g. make ServerName misc-domains and then the others as ServerAlias [16:16:07] cache is always confusing, that's why I find myself so often confused! :) [16:16:09] I'm confused as to why it would affect the Host header sent to php though [16:16:25] bblack: go to sleep! :-) [16:16:37] I am Soon now, just tidying up a few bits first! [16:16:42] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/home/bstorm/.bashrc],File[/home/bstorm/.screenrc],File[/home/bstorm/.vimrc] [16:16:59] bblack must be sleeping already. It is just the cache being flushed to the irc channel [16:17:03] eddiegp: I just updated production and head [16:17:41] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/bstorm/.bashrc] [16:17:54] puppet has run on the appservers [16:18:06] no_justification having your brain power for helping with that going forward^ woudl also been appreciated, including your patches [16:18:28] (in meeting, one moment) [16:18:45] Krinkle: I presume it's SERVER_NAME used somewhere in php. [16:19:01] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/home/bstorm/.bash_profile],File[/home/bstorm/.bashrc] [16:19:11] That one *might* always be set to the canonical server name, not necessarily what was sent by the client [16:19:13] things seem back to normal for me [16:19:24] so I will disconnect now [16:19:33] Yes, for me too. Sorry, forgot to say that. [16:19:39] eddiegp: If that were true, production wouldn't be working and we'd be seeing "Unknown wiki" on all wiki domains. [16:19:45] jynus: Thanks! Have a nice weekend. [16:19:47] I think multiversion does something with servername [16:19:49] eddiegp: we use SERVER_NAME to determine which wiki to initialise. [16:19:52] same, bye! [16:19:53] Yes. [16:20:17] Krinkle: Let me rephrase that - I think you can configure apache to do both. [16:20:28] And that configuration might differ between beta and prod [16:20:53] Right [16:21:10] There might be a setting somewhere to enforce normalisation that we have on beta and off in prod [16:21:11] actually.. [16:22:07] https://secure.php.net/manual/en/reserved.variables.server.php says UseCanonicalName [16:22:15] UseCanonicalName off [16:22:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1114 increase weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426051 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [16:22:21] Is the default [16:22:47] (03PS1) 10Arturo Borrero Gonzalez: labstore: nfs-exportd: prevent flushing all exports due to errors [puppet] - 10https://gerrit.wikimedia.org/r/426103 (https://phabricator.wikimedia.org/T145919) [16:23:07] There's a bunch of matches for 'UseCanonicalName off' in prod [16:23:17] and one match that turns it on [16:23:18] https://github.com/wikimedia/puppet/blob/e959321aa620b77403cc9379db2e86080323c6e8/modules/mediawiki/templates/apache/apache2.conf.erb#L55 [16:23:26] So I guess we do opt-out for all wiki-related domains [16:23:32] But we forgot to add it to this one [16:23:49] (03Merged) 10jenkins-bot: db-eqiad.php: db1114 increase weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426051 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [16:24:25] eddiegp: Wanna draft the commit for next time? [16:24:40] 10Operations, 10Ops-Access-Requests: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130003 (10pmiazga) [16:24:40] Yeah, I'll do that. [16:24:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1104 some main traffic - T191996 (duration: 01m 00s) [16:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:04] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [16:26:30] !log disable puppet in labstore1005 to hot-test https://gerrit.wikimedia.org/r/#/c/426103/ [16:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:43] 10Operations, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130018 (10ayounsi) 1/ Flow-control not helping, reverted 2/ Are the other servers seeing the same bursts of inbound sessions? 3/ The `ifconfig` input drop counter matches the nic stats... [16:27:08] 10Operations, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130019 (10Marostegui) So given that db1066 and db1080 have the same traffic than db1114 (and even more when db1114 gets depooled from API) and they don't suffer any kind of issues, could... [16:27:28] (03PS1) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 [16:28:48] Also, I should ensure => absent the config file I'm deleting... [16:29:23] (03CR) 10jenkins-bot: db-eqiad.php: db1114 increase weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426051 (https://phabricator.wikimedia.org/T191996) (owner: 10Marostegui) [16:34:15] 10Operations, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130039 (10Marostegui) >>! In T191996#4130018, @ayounsi wrote: > 1/ Flow-control not helping, reverted > Cool > 2/ Are the other servers seeing the same bursts of inbound sessions? Th... [16:34:53] !log upgrading packages on labvirt1016 and rebooting (1016 is a spare server that won't affect VPS users) [16:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:00] (03PS2) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 [16:38:47] (03PS1) 10Marostegui: db-eqiad.php: Restoring db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426108 [16:39:29] (03PS3) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 [16:40:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restoring db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426108 (owner: 10Marostegui) [16:41:21] Krinkle ^ looks good? [16:41:23] (03Merged) 10jenkins-bot: db-eqiad.php: Restoring db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426108 (owner: 10Marostegui) [16:41:40] (03CR) 10jenkins-bot: db-eqiad.php: Restoring db1114 main traffic weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/426108 (owner: 10Marostegui) [16:42:33] I'd test it again by cherry-picking it. Should have done that before signing it up for puppet swat to begin with... [16:42:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1104 origina main traffic weight (duration: 01m 00s) [16:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:05] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10Patch-For-Review: db1114 connection issues - https://phabricator.wikimedia.org/T191996#4130046 (10Marostegui) [16:44:01] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:45:35] (03PS4) 10Krinkle: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [16:46:21] (03CR) 10Krinkle: beta: Combine commons, deployments, meta and zero vhost (3) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [16:46:41] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:47:41] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:47:47] (03PS1) 10Imarlier: webperf and coal: add scap_target stanzas [puppet] - 10https://gerrit.wikimedia.org/r/426112 (https://phabricator.wikimedia.org/T191994) [16:48:10] (03PS5) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost (3) [puppet] - 10https://gerrit.wikimedia.org/r/426104 [16:50:48] (03CR) 10EddieGP: beta: Combine commons, deployments, meta and zero vhost (3) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [16:51:21] (03PS1) 10Andrew Bogott: Revert "Horizon: put in maintenance mode" [puppet] - 10https://gerrit.wikimedia.org/r/426113 [16:51:57] (03CR) 10Andrew Bogott: [C: 032] Revert "Horizon: put in maintenance mode" [puppet] - 10https://gerrit.wikimedia.org/r/426113 (owner: 10Andrew Bogott) [16:55:06] !log enable puppet in labstore1005 [16:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:33] (03CR) 10Madhuvishy: [C: 031] labstore: nfs-exportd: prevent flushing all exports due to errors [puppet] - 10https://gerrit.wikimedia.org/r/426103 (https://phabricator.wikimedia.org/T145919) (owner: 10Arturo Borrero Gonzalez) [16:59:18] (03CR) 10Dzahn: "yea, so i wasn't sure if i want to remove it for all or keep it for all. i wanted to be consistent though and have removed it from 2 other" [puppet] - 10https://gerrit.wikimedia.org/r/425945 (owner: 10Dzahn) [16:59:35] (03CR) 10EddieGP: "cherry-picked on deployment-puppetmaster02. Seems to work nicely." [puppet] - 10https://gerrit.wikimedia.org/r/426104 (owner: 10EddieGP) [17:00:58] 10Operations, 10Ops-Access-Requests: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130101 (10pmiazga) [17:08:09] (03PS2) 10Dzahn: installserver: convert nested roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/425945 [17:09:23] (03CR) 1020after4: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/426112 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:12:11] fdans: i noticed you are compiling a puppet change. fyi, if you leave the "list of nodes" field blank it will try compiling that on a LOT of instances and it will take a long time (and block subsequent compiles waiting in line). if you limit that to stat1005 or the actual host.. it will be done in no-time. [17:13:14] it tries to run it on one node per each "node" line in site.pp afaict [17:14:06] 10Operations, 10Ops-Access-Requests: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130118 (10Jdlrobson) Piotr has been with us over a year now and we lean on him heavily for anything backend related. I've been impressed by his ability and care with major changes we've... [17:16:29] actually, i take that back. it did not block the next compile waiting in line as there is more than one now [17:17:09] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10927/" [puppet] - 10https://gerrit.wikimedia.org/r/425945 (owner: 10Dzahn) [17:17:23] (03PS3) 10Dzahn: installserver: convert nested roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/425945 [17:17:48] !log upgraded packages on all labvirts and restarted nova-compute [17:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:26] 10Operations, 10Ops-Access-Requests: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130003 (10Dzahn) Deployment access for Piotr has already been granted in T148477 back in October 2016 (including approval in Ops meeting). So this ticket is kind of a duplicate/re-open... [17:23:36] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130139 (10Dzahn) [17:25:27] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130143 (10pmiazga) @Dzahn yes, exactly. I wasn't sure what exactly do I need. II noticed that I'm in `deployers` group and it confused me a lot... [17:29:03] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130162 (10Dzahn) I _think_ it's that you are missing here: https://gerrit.wikimedia.org/r/#/admin/groups/21,members @greg @demon Can you appr... [17:30:18] mutante: oooohhhh that makes a lot of sense, sorry I didn't realize [17:31:16] Anyone able to merge https://gerrit.wikimedia.org/r/#/c/426112/ for me? Verified compiled change via https://puppet-compiler.wmflabs.org/compiler03/10928/. [17:31:54] (03PS1) 10Andrew Bogott: labpuppetmaster: force to Liberty [puppet] - 10https://gerrit.wikimedia.org/r/426119 (https://phabricator.wikimedia.org/T192162) [17:33:06] (03CR) 10Andrew Bogott: [C: 032] labpuppetmaster: force to Liberty [puppet] - 10https://gerrit.wikimedia.org/r/426119 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [17:36:31] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:37:11] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:43:43] (03CR) 10Dzahn: [C: 032] aptrepo::wikimedia: convert from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/425946 (owner: 10Dzahn) [17:43:49] (03PS2) 10Dzahn: aptrepo::wikimedia: convert from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/425946 [17:46:20] fdans: no worries:) that's why i wanted to share [17:46:33] (03CR) 10Dzahn: [C: 032] webperf and coal: add scap_target stanzas [puppet] - 10https://gerrit.wikimedia.org/r/426112 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:46:39] (03PS2) 10Dzahn: webperf and coal: add scap_target stanzas [puppet] - 10https://gerrit.wikimedia.org/r/426112 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:46:59] mutante: nice alex de la iglesia reference btw :) [17:47:51] mutante: thanks for that :-) [17:48:07] fdans: haha, you are one of the few who noticed that :) it's actually one more level of meta. it was a band named after the movie :p [17:48:20] (03PS3) 10Dzahn: webperf and coal: add scap_target stanzas [puppet] - 10https://gerrit.wikimedia.org/r/426112 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [17:48:23] daaamn [17:48:34] marlier: welcome! rebasing. are you gonna run puppet or should i? [17:49:07] I can run it on hafnium, don' [17:49:15] t have permissions to do so on graphite1001, though. [17:49:25] Or I can jsut wait until it runs on its own, honestly [17:50:07] better to run it right away (one 1 host is enough though) [17:50:17] i submitted it on the master, you can go ahead [17:51:03] k, I'll run on hafnium. [17:51:09] One minute [17:54:00] Puppet's logging an error because I haven't actually pulled the repo down to tin yet -- deploy-local fails. Looks like the run still completes, though, and exits with 0 [17:54:29] anyone lazy enough with mailing list super-power? [17:54:57] mutante: As long as it's not triggering any alerts, I think it's fine, and I'll get to work on getting the repo into the right place. [17:56:41] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[performance/navtiming] [17:57:06] If someone wants to merge https://gerrit.wikimedia.org/r/#/c/426104/ again, please do. This time I've cherry-picked it on beta first, so it's a noop for both beta and prod. [17:57:53] LOL and there's the error, awesome. [17:59:26] marlier: nice timing, i was about to say :) [18:08:32] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[performance/coal] [18:10:12] (03CR) 10Dzahn: [C: 031] "you may have to split this into 2 changes. one creating a new empty group, and one adding springle to the group. i think "add new user to " [puppet] - 10https://gerrit.wikimedia.org/r/425263 (https://phabricator.wikimedia.org/T191478) (owner: 10ArielGlenn) [18:11:45] marlier: i'll let you work on the repo and leave it as is. just if it can't be fixed today, let's revert instead of leaving it over the weekend [18:12:07] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4130327 (10herron) >>! In T175361#3627483, @akosiaris wrote: > Do we indeed ? Do we know if we have `Good` reputation ? In my (granted old) experience, reputation is usually either `Neutral` or `Bad`. `Neutral` i... [18:12:41] mutante: should be fixed in a minute [18:12:55] First time setting something up for scap so it's taking a bit. [18:13:02] Didn't know about --init in the repo [18:14:10] sounds good , no particular rush [18:14:25] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4130337 (10ayounsi) [18:14:28] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4130334 (10ayounsi) 05Open>03Resolved a:03ayounsi Renamed. Feel free to re-open that tasks for the future hosts. [18:16:31] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.00, 37.10, 31.74 [18:18:21] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4130360 (10ayounsi) >>! In T175361#4130327, @herron wrote: > @ayounsi would it be difficult to temporarily reject packets to for example mx2001.wikimedia.org:25/tcp with a network firewall (as a poor man's depool... [18:21:31] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.77, 38.08, 33.33 [18:21:41] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:21:42] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[performance/coal] [18:21:45] mutante: hafnium fixed, give me about 5 more minutes for graphite2001 (and graphite1001, though I haven't actually seen an alert come through for it yet.) [18:21:48] What timing! [18:25:41] (03CR) 10EddieGP: [C: 04-1] "See inline comment; other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad) [18:26:10] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4130381 (10atgo) @Nuria @BBlack thanks for all the work on this! Once it's resolved, will the data for the time window that this was a... [18:26:11] marlier: :) alright [18:27:14] (03CR) 10EddieGP: [C: 031] Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 (owner: 10Chad) [18:28:13] (03CR) 10EddieGP: [C: 031] Move wiktionary and foundationwiki docroots to standard docroot [puppet] - 10https://gerrit.wikimedia.org/r/402090 (https://phabricator.wikimedia.org/T126306) (owner: 10Chad) [18:29:57] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10PHP 7.0 support, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4130397 (10Jdforrester-WMF) [18:31:00] (03PS1) 10Dzahn: icinga: temp remove Rob from paging [puppet] - 10https://gerrit.wikimedia.org/r/426132 [18:31:18] (03CR) 10jerkins-bot: [V: 04-1] icinga: temp remove Rob from paging [puppet] - 10https://gerrit.wikimedia.org/r/426132 (owner: 10Dzahn) [18:32:03] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130418 (10MaxSem) >>! In T192159#4130118, @Jdlrobson wrote: > @Niharika @MaxSem I don't suppose either of you would be able to help get him up... [18:32:31] mutante: puppet should run on graphite1001/graphite2001 at this point, but I don't have permissions to do so. [18:34:12] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4130422 (10Nuria) @atgo, It cannot be, we no longer have the original Ips of the records that are wrongly labeled. [18:36:23] 10Operations, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Proxies information gone from Zero portal - https://phabricator.wikimedia.org/T187014#4130431 (10atgo) Ok, thanks @nuria [18:36:35] marlier: ok, on it [18:37:18] (03PS2) 10Dzahn: icinga: temp remove Rob from paging [puppet] - 10https://gerrit.wikimedia.org/r/426132 [18:37:33] marlier: ack, it runs fine. Notice: /Stage[main]/Coal/Scap::Target[performance/coal]/Package[performance/coal]/ensure: created [18:37:48] that happened only on graphite1001 [18:37:54] Awesome. Sorry about the noise [18:37:58] no worries [18:38:08] Actually, here's a question: does codfw have a different deploy host? [18:38:32] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [18:38:47] Apparently not [18:39:08] Thanks, mutante: you're the best. [18:40:58] marlier: yes, it has naos.codfw.wmnet. but no, it's not used. only either eqiad or codfw are the active deployment server at a time [18:41:12] it could be flipped over but it's only tin right now [18:41:19] Rockin' [18:41:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 45.90, 33.33, 32.50 [18:41:42] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:02:45] (03CR) 10Dzahn: [C: 032] "as requested" [puppet] - 10https://gerrit.wikimedia.org/r/426132 (owner: 10Dzahn) [19:09:11] PROBLEM - designate-api http on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:03] Can I get a root to nuke /srv/mediawiki-staging/php-1.31.0-wmf.24 on tin? Some stuff in .git isn't letting me delete it [19:15:08] (trying to clear the old deploy directory out) [19:16:53] 3 ... 2.... 1 [19:17:02] gone [19:17:12] no_justification: ^^ [19:17:24] tyvn [19:17:32] *tyvm [19:17:43] yw [19:21:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 50.80, 35.83, 33.56 [19:23:05] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.25 (duration: 05m 03s) [19:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:32] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.106 second response time [19:41:58] (03PS1) 10Eevans: cassandra: restore (most) G1GC settings to defaults [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) [19:43:54] (03CR) 10Eevans: [C: 04-1] "Not just yet; Not on a Friday..." [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) (owner: 10Eevans) [19:59:01] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.08, 35.59, 32.73 [20:00:19] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.28 [keeping static files] (duration: 01m 34s) [20:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:59] hi ops! I've got a firewall update request in for a payment processor IP change: https://phabricator.wikimedia.org/T191669 [20:02:27] haven't seen any updates since cwd did the payments-cluster-side work [20:02:40] and the payment processor deadline for the switchover is Monday :( [20:02:49] sorry to bring this up on a friday [20:09:13] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4130711 (10Gilles) [20:10:29] 10Operations, 10Performance-Team: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#3391697 (10Gilles) a:03Gilles I don't think it's flamegraph.pl's fault, the issue is with the last line of the log file, which seems to be truncated in the repro example... [20:21:54] (03PS1) 10Gilles: Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) [20:24:50] Hi operations! Anyone here who can help with a firewall change? [20:25:08] I meant to bring this up yesterday, sorry [20:25:13] https://phabricator.wikimedia.org/T191669 [20:25:14] https://phabricator.wikimedia.org/T191669 [20:25:18] (oops0 [20:25:19] ) [20:26:31] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.096 second response time [20:33:53] (03PS45) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [20:43:54] (03PS46) 10Aaron Schulz: Add mcrouter module and mcrouter_wancache profile and enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/392221 [20:44:08] !log imarlier@tin Started deploy [performance/navtiming@8b6ab4e]: initial attempt to deploy navtiming via scap (will not be active) [20:44:11] !log imarlier@tin Finished deploy [performance/navtiming@8b6ab4e]: initial attempt to deploy navtiming via scap (will not be active) (duration: 00m 02s) [20:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:31] RECOVERY - designate-api http on labservices1001 is OK: HTTP OK: HTTP/1.1 200 OK - 571 bytes in 0.005 second response time [20:54:39] (03PS2) 10Bstorm: Use ::exim4 consistently when applying class [puppet] - 10https://gerrit.wikimedia.org/r/421375 (owner: 10BryanDavis) [20:55:35] (03CR) 10Bstorm: [C: 032] Use ::exim4 consistently when applying class [puppet] - 10https://gerrit.wikimedia.org/r/421375 (owner: 10BryanDavis) [21:01:11] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.20, 39.06, 32.57 [21:02:48] (03PS1) 10Andrew Bogott: designate: fixup the authtoken filter in api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/426237 (https://phabricator.wikimedia.org/T192174) [21:03:53] (03PS2) 10Andrew Bogott: designate: fixup the authtoken filter in api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/426237 (https://phabricator.wikimedia.org/T192174) [21:08:14] (03CR) 10Andrew Bogott: [C: 032] designate: fixup the authtoken filter in api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/426237 (https://phabricator.wikimedia.org/T192174) (owner: 10Andrew Bogott) [21:30:12] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 43.37, 32.02, 32.38 [21:37:46] andrewbogott: any chance you can help me with a firewall change? [21:37:51] https://phabricator.wikimedia.org/T191669 [21:38:08] I'd meant to follow up with ops yesterday but I forgot [21:38:09] ejegg: "You do not have permission to view this object." [21:38:15] oh no! [21:38:15] And by 'you' I mean me [21:38:32] that explains why ops never got to it.... [21:39:29] aha, i had it set to 'fundraising' and not 'nda' [21:39:53] ejegg: I can see it now! [21:39:54] it's 'nda' now, though i suppose a couple of payment processor ip addresses are pretty public [21:40:12] really sorry to pester you about this on a friday afternoon [21:40:28] but the payment processor is turning off the old api address monday [21:41:02] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130954 (10Niharika) I'll be happy to help. :) [21:41:08] is this a simple procedure? [21:41:27] Sounds like cwd got the files set up, he just needed a key to a particular machine [21:43:47] ejegg: sorry, there's really not enough context in that task for me to understand what's happening… this is something that needs to change in the fundraising cluster? [21:44:23] andrewbogott: the fundraising cluster has two layers of firewall, as I understand [21:44:43] there's a change that fr-tech-ops makes, which is already done [21:45:10] then there's another that the broader ops team makes, which is not yet done [21:45:41] 10Operations, 10Ops-Access-Requests, 10Gerrit, 10Release-Engineering-Team: Requesting access to deployment for pmiazga - https://phabricator.wikimedia.org/T192159#4130960 (10Jdlrobson) Thanks both <3 [21:45:54] Ok, then this is probably outside the realm of things that I know how to do. I think you should restate the case on that bug explaining what traffic you need going from where to where (and which is the originating traffic? Because the task makes it sound like you're trying to call out from FR to an external site which is different from allowing incoming traffic)... [21:46:20] andrewbogott: yeah, exactly, the payments cluster needs to make outgoing calls [21:46:26] and then you'll need someone like arzhel or faidon to look at it I think [21:46:40] ejegg: best to put that on the task :) [21:46:45] ok, thanks for the pointer! [21:47:20] Unfortunately most of the people who would help you are in EU timezones so won't be around until early Monday or Tuesday [21:47:38] Jeff might know how to fix it though. [21:47:40] ah crap [21:47:44] ok, thanks again! [21:56:05] Ejegg, I just stepped out for a backpacking trip, earliest I can tackle it is Monday evening Pacific time [21:57:51] Thanks XioNoX! [21:58:19] I'll try to ping faidon and arzhel monday morning [21:58:57] Well, I'm Arzhel :) [22:00:05] ohhh, got it! [22:00:15] nice to irc-meet you [22:20:16] (03PS1) 10Andrew Bogott: designate: update policy.json to use 'zone' instead of 'domain' [puppet] - 10https://gerrit.wikimedia.org/r/426270 [22:21:06] (03CR) 10Andrew Bogott: [C: 032] designate: update policy.json to use 'zone' instead of 'domain' [puppet] - 10https://gerrit.wikimedia.org/r/426270 (owner: 10Andrew Bogott) [22:52:13] 10Puppet, 10Cloud-Services, 10Documentation: Missing documentation for labs puppet roles - https://phabricator.wikimedia.org/T91770#1095890 (10Quiddity) @awight Is this task still relevant, given the many changes over the last 3 years? Hopefully the details you were looking for are now in https://wikitech.wi... [22:57:22] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 17.03, 21.01, 23.83 [23:04:34] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Long-lived cherry-picks on deployment-puppetmaster02.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T191294#4131090 (10EddieGP) 05Open>03declined wontfix until {T135427}. We did, do and will just ignore that check, or... [23:08:34] if we would have installed phabricator on a misc server with element/star names... it would have to be one of these for codfw: Phact, Phecda, Pherkad [23:16:27] 10Puppet, 10Beta-Cluster-Infrastructure, 10cloud-services-team: labs-puppetmaster/Labs Puppetmaster HTTPS is UNKNOWN since [...] - https://phabricator.wikimedia.org/T191553#4131114 (10EddieGP) p:05Triage>03Low [23:16:47] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-secureredirexperiment puppet error - https://phabricator.wikimedia.org/T191663#4131115 (10EddieGP) p:05Triage>03Normal [23:17:07] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet errors - https://phabricator.wikimedia.org/T191109#4131116 (10EddieGP) p:05Triage>03Normal [23:21:16] "nunki" will be the new terbium [23:21:43] because i did not pick mothallah or muliphein [23:22:04] naming servers is serious business [23:22:51] Zubenelhakrabi, Zubeneschamali and Zubenelgenubi are reserved for future use [23:25:51] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#4131124 (10EddieGP) [23:25:59] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Cloud-Services: Implement role based hiera lookups for labs - https://phabricator.wikimedia.org/T120165#4131122 (10EddieGP) 05Open>03declined I agree with the previous comments. Horizons prefix functionality seems to cover about everything this w... [23:30:41] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 330 MB (3% inode=75%) [23:31:45] (03PS1) 10Dzahn: add mgmt DNS for nunki, new eqiad maintenance server [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) [23:33:32] (03PS2) 10Dzahn: add mgmt DNS for nunki, new eqiad maintenance server [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) [23:33:53] (03PS3) 10Dzahn: add mgmt DNS for nunki, new eqiad maintenance server [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) [23:38:16] (03CR) 10Dzahn: [C: 04-2] "duuh.. use an element name! we were still talking eqiad here, not codfw" [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [23:42:32] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4131143 (10Dzahn) [23:42:34] 10Operations, 10hardware-requests: request to assign WMF3565 as terbium equivalent - https://phabricator.wikimedia.org/T192185#4131143 (10Dzahn) [23:51:34] (03PS4) 10Dzahn: add mgmt DNS for nihonium, new eqiad maintenance server [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092)