[02:49:49] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [02:51:29] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.79 ms [03:15:05] (03CR) 10Jayprakash12345: [C: 031] Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [03:33:29] PROBLEM - puppet last run on mw2169 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:50:00] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89921.90 seconds [04:00:59] RECOVERY - puppet last run on mw2169 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [05:05:10] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:06:09] PROBLEM - Nginx local proxy to apache on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:35:01] (03PS6) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [07:28:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [07:28:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [07:42:40] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:42:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:04:18] (03CR) 10Ladsgroup: labs: Use redis lock manager for dispatching changes of Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381529 (https://phabricator.wikimedia.org/T175109) (owner: 10Ladsgroup) [10:05:53] (03CR) 10Ladsgroup: [C: 031] "If you want me to deploy, I'm around." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381615 (https://phabricator.wikimedia.org/T177153) (owner: 10Hoo man) [11:54:43] (03PS1) 10Jayprakash12345: Change the zh-classicalwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381630 (https://phabricator.wikimedia.org/T177165) [12:08:03] (03PS6) 10Zoranzoki21: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [12:08:12] (03CR) 10jerkins-bot: [V: 04-1] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [12:08:34] (03CR) 10Zoranzoki21: [C: 031] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [12:10:43] (03CR) 10Jayprakash12345: "This is my First Logo Changing Patch. I Test the Logo on My Local machine. Everything is ok. See https://imgur.com/2ebyvgG" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381630 (https://phabricator.wikimedia.org/T177165) (owner: 10Jayprakash12345) [12:20:07] (03CR) 10Zoranzoki21: [C: 031] Change the zh-classicalwiki logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381630 (https://phabricator.wikimedia.org/T177165) (owner: 10Jayprakash12345) [12:28:14] (03CR) 10Jayprakash12345: [C: 04-1] "1. Add Comment in InitialiseSettings.php about change like //T12345" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380563 (https://phabricator.wikimedia.org/T176008) (owner: 10Zoranzoki21) [12:52:28] (03Abandoned) 10Zoranzoki21: Change Turkish Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380563 (https://phabricator.wikimedia.org/T176008) (owner: 10Zoranzoki21) [13:23:41] (03PS1) 10Hashar: contint: move slave and website roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/381632 [13:41:31] (03PS2) 10Hashar: contint: move slave and website roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/381632 [13:41:53] (03CR) 10jerkins-bot: [V: 04-1] contint: move slave and website roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/381632 (owner: 10Hashar) [13:44:10] (03PS3) 10Hashar: contint: move slave and website roles to profiles [puppet] - 10https://gerrit.wikimedia.org/r/381632 [13:46:13] (03CR) 10Hashar: "Lets convert CI to role/profile. Patch seems good now: https://puppet-compiler.wmflabs.org/compiler02/8125/" [puppet] - 10https://gerrit.wikimedia.org/r/381632 (owner: 10Hashar) [13:50:25] !log restart hhvm on mw1167 (jobrunner) - hhvm stuck, dump-debug in /tmp/hhvm.9624.bt. [13:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:50] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [13:51:50] RECOVERY - Nginx local proxy to apache on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.006 second response time [14:02:09] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:09] PROBLEM - Nginx local proxy to apache on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:28] (03Abandoned) 10Jayprakash12345: Temporary IP Cap Lift on zh.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381442 (https://phabricator.wikimedia.org/T177071) (owner: 10Jayprakash12345) [15:27:21] (03Draft2) 10Jayprakash12345: Enable the Extension:SandboxLink on the nlwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381634 [15:28:03] (03PS3) 10Jayprakash12345: Enable the Extension:SandboxLink on the nlwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381634 (https://phabricator.wikimedia.org/T177170) [15:33:52] (03CR) 10Zoranzoki21: [C: 031] Enable the Extension:SandboxLink on the nlwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381634 (https://phabricator.wikimedia.org/T177170) (owner: 10Jayprakash12345) [15:36:28] (03Draft2) 10Zoranzoki21: Removing expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381635 [16:10:10] PROBLEM - MegaRAID on db1056 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:10:12] ACKNOWLEDGEMENT - MegaRAID on db1056 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T177171 [16:10:17] 10Operations, 10ops-eqiad: Degraded RAID on db1056 - https://phabricator.wikimedia.org/T177171#3649267 (10ops-monitoring-bot) [16:34:00] PROBLEM - Apache HTTP on mw1221 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [16:35:00] RECOVERY - Apache HTTP on mw1221 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.076 second response time [16:36:50] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [16:36:50] RECOVERY - Nginx local proxy to apache on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.008 second response time [17:54:53] (03CR) 10BryanDavis: Microtask for Outreachy(Round15) that describes the understanding of the webservice commands. webservice --backend kubernetes start webservi (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/380568 (owner: 10Sowjanyavemuri) [18:09:25] (03CR) 10BryanDavis: "Nice explanation. I'd like to see a couple of things changed to give you a bit of practice with amending a Gerrit patch:" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/381296 (owner: 10Mridubhatnagar) [18:21:46] (03PS1) 10Ladsgroup: Remove deprecated config variable for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381642 (https://phabricator.wikimedia.org/T129475) [18:47:13] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649345 (10Paladox) [18:47:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649358 (10Paladox) p:05Triage>03Unbreak! [18:51:47] (03PS1) 10Hashar: Convert zuul::merger to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381646 [18:58:11] !log restarted Jenkins. Was stuck somehow [18:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:06] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649345 (10Legoktm) Disk space looks OK to me: ``` legoktm@contint1001:~$ df -h Filesystem Size Used Avail Use% Mounted on udev... [19:05:54] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649378 (10Paladox) i think Caused by: java.lang.OutOfMemoryError: PermGen space could be the ram. [19:08:38] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649393 (10Paladox) p:05Unbreak!>03High Hashar restarted it and it came back on, he also said it happened last friday too. so maybe we should investigate why... [19:11:07] (03PS1) 10Hashar: Convert zuul::server to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381647 [19:13:41] (03PS1) 10Hashar: contint: move an include from site.pp to role [puppet] - 10https://gerrit.wikimedia.org/r/381648 [19:14:18] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649398 (10Paladox) See https://wiki.jenkins.io/display/JENKINS/Builds+failing+with+OutOfMemoryErrors#BuildsfailingwithOutOfMemoryErrors-HeaporPermgen? [19:26:14] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI is down (jenkins) - https://phabricator.wikimedia.org/T177174#3649400 (10Paladox) https://jenkins.io/blog/2016/11/21/gc-tuning/ java 8 may improve things. [19:26:55] (03PS1) 10Hashar: contint: move jenkins from role to a profile [puppet] - 10https://gerrit.wikimedia.org/r/381649 [19:36:47] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler02/8127/ but it diff against tip of the branch so that does not take in account the parent ch" [puppet] - 10https://gerrit.wikimedia.org/r/381649 (owner: 10Hashar) [19:38:32] (03PS1) 10Jayprakash12345: Change the Turkish Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381650 (https://phabricator.wikimedia.org/T176008) [19:39:40] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 640.22 seconds [19:45:25] (03CR) 10Zoranzoki21: [C: 031] "Excellent.. I only can +1 to put :(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381650 (https://phabricator.wikimedia.org/T176008) (owner: 10Jayprakash12345) [20:15:59] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 287.10 seconds [20:18:01] (03CR) 10Qgil: [C: 04-1] "Interwiki support for Newsletter extension needs to be solved before deploying this extension in other Wikimedia wikis. Please don't deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) (owner: 10Zoranzoki21) [20:34:19] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:02:39] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:13:44] (03PS1) 10Gerrit Patch Uploader: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381704 [21:13:45] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381704 (owner: 10Gerrit Patch Uploader) [21:17:11] (03CR) 10Zoranzoki21: [C: 04-1] "I can not to abandon this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381704 (owner: 10Gerrit Patch Uploader) [21:26:31] (03PS1) 10Zoranzoki21: Test of made patch using windows command-line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381705 [21:26:42] (03CR) 10jerkins-bot: [V: 04-1] Test of made patch using windows command-line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381705 (owner: 10Zoranzoki21) [21:26:55] (03Abandoned) 10Zoranzoki21: Test of made patch using windows command-line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381705 (owner: 10Zoranzoki21) [21:26:56] (03Abandoned) 10BryanDavis: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381704 (owner: 10Gerrit Patch Uploader) [21:30:52] !log powercycling labvirt1015 [21:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649499 (10bd808) 05Resolved>03Open @andrew moved 9 VMs to this host on 2017-09-29. On 2017-10-01 we found it non-responsive to ssh and with this output on the management... [21:36:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649502 (10bd808) Console logging on boot: ``` labvirt1015 login: [ 48.163451] kvm [3714]: vcpu0 unhandled rdmsr: 0x611 [ 48.169003] kvm [3714]: vcpu0 unhandled rdmsr: 0x... [21:41:00] (03PS7) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [21:48:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649518 (10chasemp) Note to self: fix cold-migrate to handle already shut down instances [22:05:12] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649543 (10chasemp) Entirety of labvirt1015 console during crash https://usercontent.irccloud-cdn.com/file/mwxQTBO0/Screen%20Shot%202017-10-01%20at%202.26.59%20PM.png [22:05:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649544 (10Andrew) The last syslog before reboot was at Oct 1 01:21:01. It was down for many hours and didn't page because I downtimed it during the hardware replacement an... [22:19:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649602 (10Andrew) Here's the latest mcelog. Without timestamps it's hard to correlate this to the failures but still seems bad. {F9946281} [22:21:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3466006 (10Paladox) >>! In T171473#3649602, @Andrew wrote: > Here's the latest mcelog. Without timestamps it's hard to correlate this to the failures but still seems bad. >...