[00:02:37] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - HHVM rendering on mw2257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - HHVM rendering on mw2276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - PHP7 rendering on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - PHP7 rendering on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - PHP7 rendering on mw2284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - HHVM rendering on mw2176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:39] PROBLEM - HHVM rendering on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:40] PROBLEM - HHVM rendering on mw2238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:40] PROBLEM - HHVM rendering on mw2269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:41] PROBLEM - HHVM rendering on mw2168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:41] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:42] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:35] hmm [00:03:36] is that expected? [00:03:47] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.282 second response time [00:03:47] RECOVERY - HHVM rendering on mw2276 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.286 second response time [00:03:47] RECOVERY - HHVM rendering on mw2257 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.289 second response time [00:03:47] RECOVERY - PHP7 rendering on mw2262 is OK: HTTP OK: HTTP/1.1 200 OK - 80004 bytes in 0.294 second response time [00:03:47] RECOVERY - PHP7 rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 80004 bytes in 0.299 second response time [00:03:47] RECOVERY - PHP7 rendering on mw2284 is OK: HTTP OK: HTTP/1.1 200 OK - 80004 bytes in 0.316 second response time [00:03:49] RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.288 second response time [00:03:49] RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.291 second response time [00:03:49] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.288 second response time [00:03:49] RECOVERY - HHVM rendering on mw2238 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.291 second response time [00:03:50] RECOVERY - HHVM rendering on mw2168 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.290 second response time [00:03:50] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.290 second response time [00:03:51] RECOVERY - HHVM rendering on mw2269 is OK: HTTP OK: HTTP/1.1 200 OK - 79963 bytes in 0.354 second response time [00:03:58] this fixes it :P [00:04:57] lol [00:09:47] (03PS3) 10Elukey: Fix ports for wmcs/labs' Prometheus Memcached exporters [puppet] - 10https://gerrit.wikimedia.org/r/487453 [00:11:10] (03CR) 10GTirloni: [C: 03+2] Fix ports for wmcs/labs' Prometheus Memcached exporters [puppet] - 10https://gerrit.wikimedia.org/r/487453 (owner: 10Elukey) [00:11:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14501/" [puppet] - 10https://gerrit.wikimedia.org/r/487453 (owner: 10Elukey) [00:29:49] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:40:32] (03PS1) 10Elukey: Fix wmcs' prometheus memcached exporter args [puppet] - 10https://gerrit.wikimedia.org/r/487456 [00:40:55] gtirloni: --^ [00:43:06] ok [00:48:37] (03CR) 10GTirloni: [C: 03+2] Fix wmcs' prometheus memcached exporter args [puppet] - 10https://gerrit.wikimedia.org/r/487456 (owner: 10Elukey) [00:50:41] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [01:51:51] (03PS1) 1020after4: Disallow local_infile for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/487459 (https://phabricator.wikimedia.org/T214248) [01:53:17] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:55:53] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [02:03:32] (03CR) 10Paladox: [C: 03+1] Disallow local_infile for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/487459 (https://phabricator.wikimedia.org/T214248) (owner: 1020after4) [02:56:16] (03CR) 10Bstorm: "If nothing else, thank you for adding the .py extension in puppet! That's a great idea. I have mixed feelings about adding wmcs on *ever" [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [02:58:23] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:27:09] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [04:04:37] 10Operations: client-cluster.js - "no such file or directory, open '/srv/visualdiff/testreduce/testrun.ids" - https://phabricator.wikimedia.org/T215049 (10GTirloni) [04:06:01] 10Operations: parsoid-vd - "no such file or directory, open '/srv/visualdiff/testreduce/testrun.ids" - https://phabricator.wikimedia.org/T215049 (10GTirloni) [04:07:06] ^^ T215049 [04:07:07] T215049: parsoid-vd - "no such file or directory, open '/srv/visualdiff/testreduce/testrun.ids" - https://phabricator.wikimedia.org/T215049 [04:07:21] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:09] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:59:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:13] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:31] PROBLEM - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [05:18:34] ACKNOWLEDGEMENT - MegaRAID on db1073 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T215050 [05:18:39] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10ops-monitoring-bot) [05:29:56] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:38:08] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:04] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [06:07:22] 10Operations, 10ops-eqiad: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10Marostegui) p:05Triage→03Normal a:03Cmjohnson Let's get it replaced sooner than later as it is a master on m5 [06:09:13] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T215050 (10Marostegui) [06:11:57] (03Abandoned) 10MaxSem: WIP: [labs] Puppetize XTools [puppet] - 10https://gerrit.wikimedia.org/r/368101 (https://phabricator.wikimedia.org/T170514) (owner: 10MaxSem) [06:15:04] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:36] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:32:20] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/sites-available/00-dummy.conf] [06:56:02] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:58:44] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:13:34] !log reset 2FA on wikitech for [[User:Cicalese]] [07:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:14] 10Operations, 10Gerrit, 10Icinga, 10monitoring: Investigate why icinga did not report high cpu/load for gerrit - https://phabricator.wikimedia.org/T215033 (10Dzahn) The `check_load` plugin can be used for that. We do use it but only on other servers, API appservers, SWIFT and a passive check for Fundraisi... [07:39:07] 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Mathew.onipe) Also to further confirm that kartotherian has a discovery entry: ` onimisionipe@elastic1017:~$ ping kartotherian.discovery.wmnet PING kartotherian.discovery.wmnet (10.2.1.1... [09:59:08] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:26:30] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [10:59:24] PROBLEM - Memory correctable errors -EDAC- on db1068 is CRITICAL: 19 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1068&var-datasource=eqiad+prometheus/ops [11:08:31] (03PS1) 10Arturo Borrero Gonzalez: graphite: refactor into role/profile [puppet] - 10https://gerrit.wikimedia.org/r/487481 [11:09:00] (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: refactor code into roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/487482 [11:10:02] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: refactor code into roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [11:11:48] (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: refactor code into roles/profiles [puppet] - 10https://gerrit.wikimedia.org/r/487482 [11:12:13] (03PS2) 10Arturo Borrero Gonzalez: graphite: refactor into role/profile [puppet] - 10https://gerrit.wikimedia.org/r/487481 [11:14:37] (03CR) 10Arturo Borrero Gonzalez: "I made this commit while working on" [puppet] - 10https://gerrit.wikimedia.org/r/487481 (owner: 10Arturo Borrero Gonzalez) [11:15:37] (03CR) 10Arturo Borrero Gonzalez: "The change with ID I7f6781aa17ed8924c13e91c83b798bdc59bb9c3c is requried by this patch." [puppet] - 10https://gerrit.wikimedia.org/r/487482 (owner: 10Arturo Borrero Gonzalez) [11:17:52] (03CR) 10Arturo Borrero Gonzalez: "Thank you very much folks for your review :-)" [puppet] - 10https://gerrit.wikimedia.org/r/487368 (owner: 10Arturo Borrero Gonzalez) [12:34:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: cold-migrate: make it datacenter-aware [puppet] - 10https://gerrit.wikimedia.org/r/487487 [12:38:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cold-migrate: make it datacenter-aware [puppet] - 10https://gerrit.wikimedia.org/r/487487 (owner: 10Arturo Borrero Gonzalez) [13:28:43] (03PS1) 10Arturo Borrero Gonzalez: openstack: cold-migrate: make nova database configurable [puppet] - 10https://gerrit.wikimedia.org/r/487491 [13:29:52] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:33:24] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T215066 (10alaa_wmde) [13:33:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cold-migrate: make nova database configurable [puppet] - 10https://gerrit.wikimedia.org/r/487491 (owner: 10Arturo Borrero Gonzalez) [13:36:40] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T215066 (10alaa_wmde) a:05alaa_wmde→03None [13:50:35] (03PS1) 10Arturo Borrero Gonzalez: openstack: cold-migrate: allow to migrate VM instances in SHUTOFF state [puppet] - 10https://gerrit.wikimedia.org/r/487493 [13:54:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cold-migrate: allow to migrate VM instances in SHUTOFF state [puppet] - 10https://gerrit.wikimedia.org/r/487493 (owner: 10Arturo Borrero Gonzalez) [13:55:52] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [14:16:49] (03PS1) 10Arturo Borrero Gonzalez: openstack: cold-migrate: use python logging [puppet] - 10https://gerrit.wikimedia.org/r/487496 [14:17:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cold-migrate: use python logging [puppet] - 10https://gerrit.wikimedia.org/r/487496 (owner: 10Arturo Borrero Gonzalez) [14:20:58] PROBLEM - puppet last run on wtp1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:52:42] RECOVERY - puppet last run on wtp1034 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [15:42:35] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Request to merge wikipedia subdomains into one to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Vpab15) [15:55:53] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Request to merge wikipedia subdomains into one to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Krenair) Well you wouldn't be able to distinguish e.g. English Wikipedia from French Wikipedia traffic by looking at the DNS lookup or... [15:59:38] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:01:42] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Request to merge wikipedia subdomains into one to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Vpab15) 05Open→03Resolved a:03Vpab15 Thanks Krenair I will mark this as resolved then [16:06:32] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Request to merge wikipedia subdomains into one to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Krenair) I'm not sure it's strictly resolved, I wouldn't say it's invalid and I don't think it would get outright declined either. I f... [16:08:12] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:25:50] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [16:29:42] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10Legoktm) [16:35:08] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to add alaasarhan to the wmde LDAP group - https://phabricator.wikimedia.org/T215066 (10WMDE-leszek) As an Engineering Manager at WMDE, I endorse this request. [16:37:30] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Request to merge wikipedia subdomains into one to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Vpab15) 05Resolved→03Open I misunderstood then. I took a look at the ESNI task you mentioned, but couldn't really understand if im... [16:53:05] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Request to merge wikipedia subdomains into one to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Vpab15) a:05Vpab15→03None [17:29:32] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:56:46] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [18:29:44] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:59:18] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:06:07] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10Cmjohnson) [19:06:34] (03PS2) 1020after4: Disallow local_infile for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/487459 (https://phabricator.wikimedia.org/T214248) [19:08:06] (03CR) 1020after4: [C: 03+1] Disallow local_infile for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/487459 (https://phabricator.wikimedia.org/T214248) (owner: 1020after4) [19:26:42] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [19:38:40] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10Cmjohnson) @GTirloni I do not have room in row A. These can go into Row D racks D2 and D7. Doing this will require a DNS (ip) change and I will have to fix the servers to use the 10G NIC. A re-in... [19:47:05] 10Operations, 10ops-eqiad: cloudstore100{8,9} - Upgrade to 10GbE - https://phabricator.wikimedia.org/T214079 (10GTirloni) @Cmjohnson that works for me. We can do both if time allows. [19:51:34] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: cloudcontrol1004 mgmt HTTPS SSL error - https://phabricator.wikimedia.org/T215075 (10GTirloni) cloudcontrol1004 is currently our standby OpenStack control server so it can be shutdown if needed. The proposed time doesn't conflict with any... [20:25:41] (03CR) 10Andrew Bogott: [C: 03+1] "Looks right to me, for v4." [dns] - 10https://gerrit.wikimedia.org/r/486504 (https://phabricator.wikimedia.org/T214448) (owner: 10Papaul) [20:29:52] PROBLEM - Host mw1299 is DOWN: PING CRITICAL - Packet loss = 100% [20:51:12] (03CR) 10Gehel: [C: 03+1] icinga: enable check for psi and omega clusters [puppet] - 10https://gerrit.wikimedia.org/r/484679 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [20:53:02] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:29:20] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:56:40] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [22:05:41] 10Operations, 10DNS, 10Domains, 10Traffic, 10HTTPS: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Aklapper) [22:34:27] I will try to reboot mw1299 from the mgmt iface [22:37:50] RECOVERY - Host mw1299 is UP: PING WARNING - Packet loss = 44%, RTA = 0.25 ms [23:16:59] !log restart pdfrender on scb1004 [23:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:52] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [23:18:47] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10Marostegui) p:05Triage→03Normal [23:59:35] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.