[01:06:55] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 35 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [01:11:56] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [01:12:36] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:26] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 72939 bytes in 0.275 second response time [03:26:26] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 725.31 seconds [03:32:55] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:33:45] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:55:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 137.19 seconds [04:00:26] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:01:16] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:19:25] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:29:26] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 9 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:36:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 50 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:01:35] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:58:36] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 56 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:08:36] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:14:16] PROBLEM - Check HHVM threads for leakage on mw1168 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:21:05] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:44:25] RECOVERY - Check HHVM threads for leakage on mw1168 is OK: OK [07:55:21] (03PS1) 10Odder: Add new logo for the Baskhir Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372777 (https://phabricator.wikimedia.org/T173471) [08:35:55] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:40:55] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:41:05] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [08:57:55] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 51 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [09:31:17] 10Operations, 10Wikimedia-Site-requests, 10Regression, 10User-Ladsgroup, and 2 others: Unblock stuck global renames at Meta-Wiki - https://phabricator.wikimedia.org/T173419#3535957 (10MarcoAurelio) 05Open>03Resolved a:03Ladsgroup Closed as everything seems back to normal. Thank you for your help. [10:28:06] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [10:35:15] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [10:50:15] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:12:15] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 45 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:37:15] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [11:56:40] (03PS1) 10Gerrit Patch Uploader: Set X-Frame-Options: SAMEORIGIN if UploadWizard enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372789 (https://phabricator.wikimedia.org/T173631) [11:56:42] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372789 (https://phabricator.wikimedia.org/T173631) (owner: 10Gerrit Patch Uploader) [12:14:25] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [12:31:46] PROBLEM - Check size of conntrack table on ms-fe1005 is CRITICAL: CRITICAL: nf_conntrack is 100 % full [12:32:47] RECOVERY - Check size of conntrack table on ms-fe1005 is OK: OK: nf_conntrack is 75 % full [12:34:25] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 14 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:00:14] (03PS2) 10Urbanecm: Reopen bawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) [13:01:52] (03PS3) 10Urbanecm: Reopen bawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) [13:02:38] (03PS2) 10Urbanecm: Add new logo for the Baskhir Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372777 (https://phabricator.wikimedia.org/T173471) (owner: 10Odder) [13:03:32] (03CR) 10Urbanecm: [C: 031] "Just removed the logo part from my patch (372212) and added correct line to this one. Note that they do not depend on each other as the wi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372777 (https://phabricator.wikimedia.org/T173471) (owner: 10Odder) [13:06:09] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372789 (https://phabricator.wikimedia.org/T173631) (owner: 10Gerrit Patch Uploader) [13:10:46] PROBLEM - puppet last run on labtestservices2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:01] (03PS1) 10Urbanecm: Update logos for srwiktionary, add HD logos for srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372795 (https://phabricator.wikimedia.org/T172245) [13:39:15] RECOVERY - puppet last run on labtestservices2003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [13:44:11] (03PS1) 10Urbanecm: Add HD logos for srwikisource, update them too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372796 (https://phabricator.wikimedia.org/T172268) [13:48:28] (03PS1) 10Urbanecm: Update logos for srwikinews, add HD version for them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372797 (https://phabricator.wikimedia.org/T172255) [13:49:06] (03PS4) 10Urbanecm: Reopen bawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372212 (https://phabricator.wikimedia.org/T173471) [13:49:18] (03PS2) 10Urbanecm: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) [14:18:26] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [14:18:32] (03PS1) 10Urbanecm: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) [14:20:08] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) (owner: 10Urbanecm) [14:38:19] (03CR) 10Zoranzoki21: [C: 031] "Looks good to me, but someone else must approve." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372795 (https://phabricator.wikimedia.org/T172245) (owner: 10Urbanecm) [14:41:24] (03PS2) 10Urbanecm: Initial configuration for hifwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372798 (https://phabricator.wikimedia.org/T173643) [14:48:26] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [17:24:06] 10Operations, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3536348 (10madhuvishy) @Papaul Aah, sorry, I had pinged you on the task and didn't know about adding to the ops-codfw board, I'll definit... [17:24:18] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3536349 (10madhuvishy) [17:25:55] PROBLEM - HHVM rendering on mw1287 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:26:55] RECOVERY - HHVM rendering on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 73718 bytes in 0.387 second response time [17:28:25] PROBLEM - Apache HTTP on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:28:36] PROBLEM - HHVM rendering on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:28:45] PROBLEM - Nginx local proxy to apache on mw1284 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [17:29:25] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.044 second response time [17:29:45] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 73718 bytes in 0.642 second response time [17:29:45] RECOVERY - Nginx local proxy to apache on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.265 second response time [17:46:16] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2005582 [19:18:11] (03CR) 10Ladsgroup: "No, redis_host is the same but gets overwritten" [puppet] - 10https://gerrit.wikimedia.org/r/369915 (https://phabricator.wikimedia.org/T169246) (owner: 10Halfak) [23:26:26] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 127898 [23:39:35] PROBLEM - Disk space on logstash1006 is CRITICAL: DISK CRITICAL - /var/lib/elasticsearch is not accessible: Input/output error [23:39:55] PROBLEM - MD RAID on logstash1006 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 [23:39:56] ACKNOWLEDGEMENT - MD RAID on logstash1006 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T173679 [23:39:59] 10Operations, 10ops-eqiad: Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3536569 (10ops-monitoring-bot) [23:57:45] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/elasticsearch]