[00:00:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:03:31] Is there a current issue with enwiki images displaying as a '?' [00:04:05] Tell them to look at the developer console [00:05:17] Reedy: they are not tech savvy [00:05:21] Like atall [00:05:37] Specs are safari on imax [00:05:41] Mac* [00:07:57] Well, if they can't help debug, we can't help fix it [00:08:03] Not much we can do without proper info [00:11:59] Nvm [00:19:07] opening the developer console and looking for big red error messages isn't exactly hard, even if you need to guide them on how [01:22:20] PROBLEM - MegaRAID on db1050 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:22:22] ACKNOWLEDGEMENT - MegaRAID on db1050 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T149509 [01:22:25] 06Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T149509#2754729 (10ops-monitoring-bot) [01:34:50] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1804.055098 Seconds [01:36:00] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 42.910314 Seconds [01:43:29] Anyone have any idea why my bot will run on bastion but not on the grid even with trusty?? Ive asked in labs but havent really got a response [01:46:36] you will need to ask in labs [02:27:37] !log l10nupdate@tin scap sync-l10n completed (1.28.0-wmf.23) (duration: 09m 01s) [02:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:14] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct 30 02:32:14 UTC 2016 (duration 4m 38s) [02:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:08] (03PS1) 10Dzahn: admin: add niedzielski to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/318775 (https://phabricator.wikimedia.org/T149233) [03:25:53] PROBLEM - thumbor@8812 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8812 is inactive [03:33:43] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:38:23] RECOVERY - thumbor@8812 service on thumbor1001 is OK: OK - thumbor@8812 is active [03:59:33] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:53] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:52] 06Operations: setup YubiHSM and laptop at office - https://phabricator.wikimedia.org/T123818#2754816 (10Dzahn) brought the laptop back to office in SF, on Thursday after metrics meeting, handed over to OIT [04:27:40] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:41:20] (03PS8) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [05:42:26] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [05:44:49] (03PS9) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [05:45:54] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [05:48:37] (03PS10) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [05:49:47] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [05:54:01] boo [05:54:14] (03PS11) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [05:55:26] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [05:55:39] (CR) jenkins-bot: [V: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [05:56:30] (CR) jenkins-bot: [C: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [05:57:00] (CR) jenkins-bot: [C: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [05:57:29] yuvipanda: ha ha how many times :D not going to merge it yet [05:57:36] or until tomorrow [05:57:55] or until jenkins agrees [05:58:00] which may be never [05:58:26] i cannot see your -2 on the patch [05:58:31] i think i can't see atleast [05:59:47] madhuvishy: :P [05:59:56] (CR) jenkins-bot: [C: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:00:05] (CR) jenkins-bot: [A: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:00:10] (CR) jenkins-bot: [B: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:00:14] (CR) jenkins-bot: [C: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:00:17] :( [06:00:51] wtf jenkins [06:01:27] (CR) jenkins-bot: [D: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:01:31] (CR) jenkins-bot: [E: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:01:58] (03PS12) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [06:02:04] (CR) jenkins-bot: [H: -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:02:18] shhh yuvipanda you'll make all this skunkworks well knownnn [06:02:54] (03CR) 10jenkins-bot: [V: 04-1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 (owner: 10Madhuvishy) [06:03:43] * madhuvishy dies aligning arrows [06:04:12] (CR) jenkins-bot => [E => -2] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner => Madhuvishy) [06:13:55] (03PS13) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [06:15:01] (CR) jenkins-bot: [V: -1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [06:15:18] yuvipanda: poda [06:15:27] * yuvipanda pos [07:04:33] (03PS14) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [07:06:58] (CR) jenkins-bot: [V: -1] jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - https://gerrit.wikimedia.org/r/288086 (owner: Madhuvishy) [07:07:15] yuvipanda: thanks for restoring balance [07:09:07] madhuvishy: yw, glad to lend a helping hand [07:09:29] !log L [07:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:47] PROBLEM - thumbor@8832 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8832 is inactive [08:07:47] RECOVERY - thumbor@8832 service on thumbor1001 is OK: OK - thumbor@8832 is active [08:31:38] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:00:27] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:13:47] (03CR) 10Hashar: [C: 031] admin: add niedzielski to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/318775 (https://phabricator.wikimedia.org/T149233) (owner: 10Dzahn) [09:35:37] PROBLEM - thumbor@8833 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8833 is inactive [09:45:37] RECOVERY - thumbor@8833 service on thumbor1002 is OK: OK - thumbor@8833 is active [10:42:58] PROBLEM - thumbor@8822 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8822 is inactive [10:45:17] RECOVERY - thumbor@8822 service on thumbor1002 is OK: OK - thumbor@8822 is active [10:54:48] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:07:58] PROBLEM - Host cp2010 is DOWN: PING CRITICAL - Packet loss = 100% [11:09:17] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:09:27] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:10:07] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:10:17] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:07] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [11:12:07] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:12:27] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [11:12:27] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:13:37] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2010_v4, cp2010_v6 [11:13:37] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:13:37] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:13:47] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:13:57] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2010_v4, cp2010_v6 [11:13:57] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2010_v4, cp2010_v6 [11:14:07] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2010_v4, cp2010_v6 [11:14:17] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:14:18] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:14:27] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:14:27] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 43 not-conn: cp2010_v6 [11:14:57] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2010_v4, cp2010_v6 [11:14:57] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp2010_v4, cp2010_v6 [11:22:07] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [11:23:17] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3077908 keys, up 221 days 3 hours - replication_delay is 0 [11:24:17] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:52:08] PROBLEM - thumbor@8833 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8833 is inactive [13:07:18] RECOVERY - thumbor@8833 service on thumbor1001 is OK: OK - thumbor@8833 is active [13:20:28] PROBLEM - HHVM rendering on mw1231 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.035 second response time [13:21:29] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 71777 bytes in 0.799 second response time [13:37:58] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:49:28] PROBLEM - HHVM rendering on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [13:50:28] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 71777 bytes in 0.171 second response time [13:53:25] (03PS3) 10Gehel: cirrus - disable the rebuild of completion indices [puppet] - 10https://gerrit.wikimedia.org/r/318267 [13:54:08] !log disabling completion suggester crons to leave place for terbium reboot [13:54:08] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [13:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:34] (03CR) 10Gehel: [C: 032] cirrus - disable the rebuild of completion indices [puppet] - 10https://gerrit.wikimedia.org/r/318267 (owner: 10Gehel) [14:09:28] PROBLEM - thumbor@8831 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8831 is inactive [14:13:58] RECOVERY - thumbor@8831 service on thumbor1002 is OK: OK - thumbor@8831 is active [14:18:39] PROBLEM - HHVM rendering on mw1206 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.008 second response time [14:19:48] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 71777 bytes in 0.252 second response time [14:32:28] PROBLEM - parsoid on wtp1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:33:28] RECOVERY - parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.229 second response time [14:47:30] (03PS3) 10Andrew Bogott: labs dns: Add mariadb::service and changes for new package [puppet] - 10https://gerrit.wikimedia.org/r/316598 (owner: 10Jcrespo) [14:59:28] (03CR) 10Andrew Bogott: "This seems like a good idea! I'm noticing that the service file this installs is radically different from the one I currently have... tha" [puppet] - 10https://gerrit.wikimedia.org/r/316598 (owner: 10Jcrespo) [15:15:58] PROBLEM - Host es2019 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:18] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [15:19:29] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:54:48] PROBLEM - thumbor@8830 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8830 is inactive [16:07:48] RECOVERY - thumbor@8830 service on thumbor1001 is OK: OK - thumbor@8830 is active [16:08:28] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:16:48] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:01] (03CR) 10Jcrespo: "This mysqld_safe and init files contains security fixes that the standad ones does not. As we cannot guarantee frequent mysql upgrades, we" [puppet] - 10https://gerrit.wikimedia.org/r/316598 (owner: 10Jcrespo) [16:29:56] es2019 most likely crashed due to hardware again [16:32:56] 06Operations, 10ops-codfw: es2019 crashed again - https://phabricator.wikimedia.org/T149526#2755141 (10jcrespo) [16:35:25] !log powercycle es2019 after crash T149526 [16:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:31] T149526: es2019 crashed again - https://phabricator.wikimedia.org/T149526 [16:36:48] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:40:41] (03CR) 10Andrew Bogott: [C: 032] "ok!" [puppet] - 10https://gerrit.wikimedia.org/r/316598 (owner: 10Jcrespo) [16:42:38] PROBLEM - HHVM rendering on mw1282 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [16:43:48] RECOVERY - HHVM rendering on mw1282 is OK: HTTP OK: HTTP/1.1 200 OK - 71765 bytes in 0.385 second response time [16:43:48] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:43:57] (03PS1) 10Andrew Bogott: Fix typo in mariadb::service include [puppet] - 10https://gerrit.wikimedia.org/r/318791 [16:44:09] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:46:14] (03CR) 10Andrew Bogott: [C: 032] Fix typo in mariadb::service include [puppet] - 10https://gerrit.wikimedia.org/r/318791 (owner: 10Andrew Bogott) [16:46:48] Did we release wmf 23 to public for download as the current ver [16:47:23] Zppix wmf 23 is wikimedia specific, but anyone can download it [16:47:32] by downloading it from the wmf 23 branch [16:47:34] Ah [16:47:49] Didnt know that lol [16:47:57] 1.28 will the current version when released [16:48:03] Ah i see [16:48:06] currently it is 1.27.1 which is an lts release [16:48:11] yep [16:48:38] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:50:10] (03PS15) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [17:35:48] PROBLEM - thumbor@8810 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8810 is inactive [17:35:48] PROBLEM - thumbor@8814 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8814 is inactive [17:37:58] RECOVERY - thumbor@8814 service on thumbor1001 is OK: OK - thumbor@8814 is active [17:38:08] RECOVERY - thumbor@8810 service on thumbor1001 is OK: OK - thumbor@8810 is active [17:38:28] PROBLEM - thumbor@8816 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8816 is inactive [17:39:48] PROBLEM - thumbor@8814 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8814 is inactive [17:40:50] Ok icinga-wm i think we got it [17:45:08] RECOVERY - thumbor@8816 service on thumbor1002 is OK: OK - thumbor@8816 is active [17:45:08] RECOVERY - thumbor@8814 service on thumbor1002 is OK: OK - thumbor@8814 is active [17:53:39] Whats up w/ grrrit-wm paladox [17:53:50] Zppix i am trying some changes [17:54:18] Like when it quit's it should always try to use the nick grrrit-wm instead of grrrit-wm+(number) [17:54:34] It now uses ssl connection [17:54:35] too [17:55:24] Hmm maybe should use another instance of bot instead of the main one to prevent devs from missing gerrit msgs? [17:55:42] Oh, i carn't, there is no other bot i can test with. [17:56:03] I meant of grrrit-wm instance [17:56:39] Yeh, there is no other bot on there. [17:56:48] I see [17:56:59] Zppix could you create a task for that, creating a test bot on that instance please? [17:57:09] Yes! [17:57:41] PROBLEM - thumbor@8813 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8813 is inactive [17:58:11] Thankyou :) [17:59:34] Done [17:59:41] I subscribed you [17:59:54] Thankyou :) [18:00:32] Anytime [18:07:18] RECOVERY - thumbor@8813 service on thumbor1001 is OK: OK - thumbor@8813 is active [19:34:18] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:01:18] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:27:08] PROBLEM - thumbor@8812 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8812 is inactive [20:30:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:32:18] PROBLEM - HHVM rendering on mw1192 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.004 second response time [20:33:19] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 71635 bytes in 0.168 second response time [20:34:38] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exportd is active [20:35:09] PROBLEM - thumbor@8823 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8823 is inactive [20:36:48] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:37:38] RECOVERY - thumbor@8812 service on thumbor1001 is OK: OK - thumbor@8812 is active [20:44:38] RECOVERY - thumbor@8823 service on thumbor1002 is OK: OK - thumbor@8823 is active [20:50:58] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:54:38] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:01:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:03:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:15:08] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:15:09] !log just took a huge shit [21:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:58] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:22:49] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:40:18] PROBLEM - thumbor@8822 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8822 is inactive [21:45:38] RECOVERY - thumbor@8822 service on thumbor1002 is OK: OK - thumbor@8822 is active [21:57:18] PROBLEM - thumbor@8820 service on thumbor1001 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8820 is inactive [21:59:18] PROBLEM - HHVM rendering on mw1191 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.024 second response time [21:59:27] (03PS1) 10Andrew Bogott: Add $managed flag to mariadb::service [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/318859 [21:59:53] (03PS2) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 [22:00:18] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 72762 bytes in 0.156 second response time [22:00:56] (03CR) 10jenkins-bot: [V: 04-1] Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 (owner: 10Andrew Bogott) [22:02:48] (03PS3) 10Andrew Bogott: Labs dns: Ensure the mysql server starts at boot [puppet] - 10https://gerrit.wikimedia.org/r/318572 [22:08:09] RECOVERY - thumbor@8820 service on thumbor1001 is OK: OK - thumbor@8820 is active [22:17:09] !log just took a shit on platonides chest [22:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:28] PROBLEM - Apache HTTP on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.010 second response time [22:28:38] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.040 second response time [22:31:34] (03PS16) 10Madhuvishy: jupyterhub: Add module to set up Jupyterhub for paws-internal [puppet] - 10https://gerrit.wikimedia.org/r/288086 [22:31:48] PROBLEM - parsoid on ruthenium is CRITICAL: connect to address 10.64.16.151 and port 8142: Connection refused [22:40:08] PROBLEM - thumbor@8828 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8828 is inactive [22:40:08] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:44:08] RECOVERY - thumbor@8828 service on thumbor1002 is OK: OK - thumbor@8828 is active [22:45:48] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.086 second response time [23:01:38] PROBLEM - thumbor@8812 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8812 is inactive [23:03:13] (03PS1) 10Yurik: Removed unused wmgUseGraphWithNamespace support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318862 [23:08:08] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:11:48] PROBLEM - thumbor@8838 service on thumbor1002 is CRITICAL: CRITICAL - Expecting active but unit thumbor@8838 is inactive [23:14:18] RECOVERY - thumbor@8812 service on thumbor1002 is OK: OK - thumbor@8812 is active [23:14:58] RECOVERY - thumbor@8838 service on thumbor1002 is OK: OK - thumbor@8838 is active [23:49:28] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]