[00:23:08] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [00:24:17] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [00:29:37] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [00:30:17] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [01:19:33] 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2138089 (10Tgr) >>! In T52864#2137966, @RobLa-WMF wrote: > @JanZerebecki - I don't have authority to resource this. I was hoping @mark or someone from #operations would respond, but I believe that s... [01:27:34] 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#553889 (10ori) >>! In T52864#2137966, @RobLa-WMF wrote: > I was hoping @mark or someone from #operations would respond @faidon did, in T52864#954874 above. [02:00:46] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [02:01:07] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [02:02:29] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [02:02:57] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042 [02:24:50] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 10m 57s) [02:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 21 02:33:31 UTC 2016 (duration 8m 41s) [02:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:22] (03PS2) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 [02:47:00] (03CR) 10jenkins-bot: [V: 04-1] Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [02:55:29] 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2138129 (10RobLa-WMF) >>! In T52864#2138089, @Tgr wrote: >>>! In T52864#2137966, @RobLa-WMF wrote: >> @JanZerebecki - I don't have authority to resource this. I was hoping @mark or someone from #ope... [03:00:29] (03PS3) 10Sabya: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 [03:09:32] (03PS1) 10Ori.livneh: Add ten additional countries to NavTiming [puppet] - 10https://gerrit.wikimedia.org/r/278701 [03:40:37] PROBLEM - RAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [03:52:58] Hello. Wikipedia seems down for me. The iOS app isn't working either. Error: Host with specified name could not be found. [03:58:38] (03PS1) 10Yuvipanda: labs: Add support for custom cnames in labs recursor [puppet] - 10https://gerrit.wikimedia.org/r/278705 (https://phabricator.wikimedia.org/T118758) [04:10:57] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:25:14] (03CR) 10Ori.livneh: "Um, for each continent except Antarctica and Oceania, that is." [puppet] - 10https://gerrit.wikimedia.org/r/278701 (owner: 10Ori.livneh) [04:26:44] Niharika: can you run traceroute to wikipedia.org? (If you're using Windows, it's "tracert") [04:27:43] Actually, that may not work, if you're not able to resolve the name [04:27:44] ori: Okay, let me get on my laptop and try that. [04:29:40] If you get an "unknown host" error, try running 'nslookup en.wikipedia.org' and make note of the "Server:" line [04:30:43] ori: Never mind, it seems to be back up now. [04:30:49] I fixed it! \o/ [04:30:55] :D [04:37:46] Niharika: when I was on Airtel, their DNS servers would fuck up like this now and then, with weird cache issues [04:37:56] Niharika: I switched my router to use Google DNS, and the problems went away [04:39:03] yuvipanda: That's probably it. I was on Google DNS but the stupid wifi at Bentley (All-hands) wouldn't let me use any other DNS except their own, and I forgot to switch back to Google ones when I got home. [04:46:35] Niharika: ah! [04:59:06] captive portals that hijack dns are the worst [05:22:08] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 88595.00 seconds [05:35:59] (03PS1) 10Ori.livneh: Add Australia to NavTiming countries [puppet] - 10https://gerrit.wikimedia.org/r/278706 [05:43:46] (03CR) 10Ori.livneh: [C: 032] Add Australia to NavTiming countries [puppet] - 10https://gerrit.wikimedia.org/r/278706 (owner: 10Ori.livneh) [06:12:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [06:22:37] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:30:47] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:18] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:47] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:48] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:57:26] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:46] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 7 failures [06:57:47] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:23:48] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:59:05] 6Operations, 10ops-eqiad: db1067 degraded RAID - https://phabricator.wikimedia.org/T130517#2138224 (10jcrespo) [08:00:13] ACKNOWLEDGEMENT - RAID on db1067 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T130517 [08:57:18] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 69 failures [09:09:13] !log restarted hhvm on mw1116 [09:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:14:58] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:04] !log Altering change_tag engine to InnoDB on db1069:3313 [09:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:14] (03PS4) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) [09:30:40] (03CR) 10jenkins-bot: [V: 04-1] Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) (owner: 10Mobrovac) [09:39:08] I think change_tag was the main cause of lag on s3 for labs, but we will see if it pays off [09:42:09] (03PS5) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) [09:48:59] (03PS3) 10Hashar: hiera_lookup: support 'labs' realm [puppet] - 10https://gerrit.wikimedia.org/r/276345 (https://phabricator.wikimedia.org/T129092) [09:49:10] (03PS1) 10Mobrovac: Citoid: Switch to the Scap3 deployment method [puppet] - 10https://gerrit.wikimedia.org/r/278710 (https://phabricator.wikimedia.org/T116337) [09:49:50] (03CR) 10Elukey: [C: 032] "Merging the change since it will be a bit difficult to test this code review on the main puppet repo. The next step is to file a code revi" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [09:50:25] derp [09:51:23] I always hate my nick when grrrit cuts the text at revi(ew) [09:51:48] (03PS4) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) [09:51:59] (03CR) 10Hashar: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [10:05:57] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 28 failures [10:08:26] PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: Puppet has 77 failures [10:14:23] (03PS1) 10Elukey: Update Analytics cdh submodule after https://gerrit.wikimedia.org/r/#/c/277984/ [puppet] - 10https://gerrit.wikimedia.org/r/278713 (https://phabricator.wikimedia.org/T129838) [10:19:37] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2138340 (10hashar) >>! In T114249#2106764, @ArielGlenn wrote: > Don't despair. I have still on my roadmap to l... [10:24:39] !log Altering user_properties engine to InnoDB on db1069:3313 [10:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:46] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:52:08] !log Live hacked puppet compiler on compiler02.puppet3-diffs.eqiad.wmflabs to debug it not processing submodules. Reinstalled it from the last tag in the process [10:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:48] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 73 failures [11:32:17] PROBLEM - HHVM rendering on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.021 second response time [11:32:58] PROBLEM - Apache HTTP on mw1133 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [11:34:07] RECOVERY - HHVM rendering on mw1133 is OK: HTTP OK: HTTP/1.1 200 OK - 71682 bytes in 8.751 second response time [11:34:46] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.060 second response time [11:36:22] (03CR) 10Mobrovac: Introducing changeprop role and puppet module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) (owner: 10Mobrovac) [11:55:56] PROBLEM - HHVM rendering on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:57:16] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:01:48] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [12:02:55] mmw1121: Mar 21 11:54:18 mw1121 kernel: [428912.210401] Out of memory: Kill process 23236 (hhvm) score 951 or sacrifice child [12:04:18] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:30:57] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2138475 (10BBlack) From a naive POV based on the screenshots alone: they're using an outdated set of Root certificates, inc... [12:52:06] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [13:06:27] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:09:43] (03PS3) 10BBlack: move most of esams to standard layout [dns] - 10https://gerrit.wikimedia.org/r/270285 [13:23:39] (03PS1) 10BBlack: remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 [13:23:41] (03PS1) 10BBlack: remove corp ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278722 [13:23:43] (03PS1) 10BBlack: remove redundant wikimedia.org. trailers [dns] - 10https://gerrit.wikimedia.org/r/278723 [13:24:23] (03CR) 10jenkins-bot: [V: 04-1] remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 (owner: 10BBlack) [13:29:10] (03PS2) 10BBlack: remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 [13:29:12] (03PS2) 10BBlack: remove redundant wikimedia.org. trailers [dns] - 10https://gerrit.wikimedia.org/r/278723 [13:29:14] (03PS2) 10BBlack: remove corp ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278722 [13:33:02] !log restbase deploy start of 26f9e90 on canary restbase1003 [13:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:26] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:47] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [14:00:06] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.034 second response time [14:00:08] RECOVERY - Restbase root url on restbase1012 is OK: HTTP OK: HTTP/1.1 200 - 15253 bytes in 0.045 second response time [14:00:48] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [14:01:28] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [14:04:21] (03CR) 10Ottomata: [C: 031] "If you feel good about it, proceed!" [puppet] - 10https://gerrit.wikimedia.org/r/278713 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [14:05:25] !log restbase deploy end of 26f9e90 [14:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:21] (03CR) 10Ottomata: "Talked with Marko a bit about this in IRC." (031 comment) [puppet/kafka] - 10https://gerrit.wikimedia.org/r/278329 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [14:23:38] !log restarting labsdb1001 mysql [14:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:16] !log altering kafka topics webrequest_text and webrequest_upload, increasing each from 12 partitions to 24 partitions [14:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:07] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [14:28:24] (03PS2) 10Ottomata: Increase number of map tasks for camus webrequest to 72 [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) [14:28:36] (03CR) 10Ottomata: [C: 032 V: 032] Increase number of map tasks for camus webrequest to 72 [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) (owner: 10Ottomata) [14:47:47] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [15:00:05] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160321T1500). [15:00:05] MatmaRex: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:07] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 61 failures [15:00:15] hello. [15:01:28] MatmaRex: Hiya, I can SWAT. [15:01:33] (03CR) 10Nuria: "I am all for this change but do not know enough of mediawiki conventions to merge. +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [15:05:07] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 61 failures [15:10:07] PROBLEM - check_puppetrun on betelgeuse is CRITICAL: CRITICAL: Puppet has 61 failures [15:11:38] that took a while to merge. [15:13:43] !log thcipriani@tin Synchronized php-1.27.0-wmf.17/includes/upload/UploadBase.php: SWAT: UploadBase: Set mFileSize, if given, even if mTempPath is unknown [[gerrit:278724]] (duration: 00m 30s) [15:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:49] ^ MatmaRex check please [15:14:00] yeah, core changes ain't quick for jenkins [15:15:07] RECOVERY - check_puppetrun on betelgeuse is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:15:29] thcipriani: thanks. i don't want to upload files in production to check this, but we have logs for the errors this fixes and i'll watch them. [15:15:39] MatmaRex: ack. Thanks. [15:41:55] thcipriani: Are you donw with SWAT? [15:42:07] hoo: yes [15:42:20] (03PS1) 10Hoo man: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278736 [15:43:09] (03CR) 10Hoo man: [C: 032] Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278736 (owner: 10Hoo man) [15:43:35] (03Merged) 10jenkins-bot: Bump $wgCacheEpoch on Wikidata after Property conversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278736 (owner: 10Hoo man) [15:44:31] !log hoo@tin Synchronized wmf-config/Wikibase.php: Bump $wgCacheEpoch on Wikidata after Property conversions (duration: 00m 28s) [15:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:03] 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2138908 (10GWicke) [15:55:10] 7Puppet, 6Revision-Scoring-As-A-Service, 10ores, 13Patch-For-Review: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#2138942 (10Halfak) 5Open>3Resolved a:3Halfak [16:17:06] (03PS3) 10Giuseppe Lavagetto: Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [16:21:46] (03Abandoned) 10Giuseppe Lavagetto: Adding more unit tests [software/conftool] - 10https://gerrit.wikimedia.org/r/278550 (owner: 10Giuseppe Lavagetto) [16:22:11] (03Abandoned) 10Giuseppe Lavagetto: Print out the tags any conftool result line is referring to [software/conftool] - 10https://gerrit.wikimedia.org/r/278551 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [16:26:06] 6Operations, 10ops-eqiad, 10RESTBase-Cassandra: restbase1007.eqiad.wmnet CPU temperature? - https://phabricator.wikimedia.org/T130370#2134035 (10GWicke) This is one of the three boxes (restbase1007-1009) where a second CPU was installed later. [16:47:17] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [16:47:26] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:52:32] on it ^ [16:54:27] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [16:56:07] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.038 second response time on port 9042 [17:00:07] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [17:05:16] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [17:10:06] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 205 seconds ago with 0 failures [17:29:06] (03PS1) 10Elukey: HDFS Namenode automatic failover support - bug fixes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) [17:45:37] PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 1140 bytes in 0.038 second response time [17:49:07] RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2493 bytes in 0.120 second response time [17:54:56] (03PS1) 10Elukey: Fix varnishkafka cronspam due to non existent rsyslog action. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/278750 (https://phabricator.wikimedia.org/T129344) [18:00:12] 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps, 13Patch-For-Review: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2135290 (10akosiaris) So, this constitutes a sudo request, so per policy we need to get this approved in the ops meeting. FWIW, I support this [18:15:39] if Ops are around, I have some simple patches for review: https://gerrit.wikimedia.org/r/278270 [18:15:46] https://gerrit.wikimedia.org/r/278271 [18:15:49] (03CR) 10Ottomata: HDFS Namenode automatic failover support - bug fixes. (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [18:20:32] (03CR) 10Ottomata: [C: 031] Fix varnishkafka cronspam due to non existent rsyslog action. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/278750 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [18:42:30] (03PS1) 10Ottomata: Add DC named topics to event bus topic config [puppet] - 10https://gerrit.wikimedia.org/r/278752 (https://phabricator.wikimedia.org/T127718) [18:43:30] (03PS2) 10Alexandros Kosiaris: Citoid: Switch to the Scap3 deployment method [puppet] - 10https://gerrit.wikimedia.org/r/278710 (https://phabricator.wikimedia.org/T116337) (owner: 10Mobrovac) [18:54:21] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Citoid: Switch to the Scap3 deployment method [puppet] - 10https://gerrit.wikimedia.org/r/278710 (https://phabricator.wikimedia.org/T116337) (owner: 10Mobrovac) [18:58:20] 6Operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#2139275 (10AdHuikeshoven) @RobLa-WMF , thanks for the kind words. The status of Discourse is a pilot a test and generates feedback about what people like and what people don't like. There are some st... [18:59:57] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [19:00:17] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [19:02:16] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [19:03:37] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [19:10:57] 6Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#2139292 (10AdHuikeshoven) [19:12:29] (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [19:12:34] (03PS3) 10Alexandros Kosiaris: Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [19:14:52] (03CR) 10Alexandros Kosiaris: [V: 032] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup) [19:17:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] "That's a stub class (apart from the base requirement). Let's populate it with something actually doing something useful :-)" [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) (owner: 10Halfak) [19:18:13] 6Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#2139305 (10AdHuikeshoven) [19:18:17] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 56 failures [19:19:00] (03CR) 10Alexandros Kosiaris: "Hm... that's a 753 line patch. I know it's supposed to be NOOP, but got to figure out how to test it before breaking someone's workflow. T" [puppet] - 10https://gerrit.wikimedia.org/r/278271 (owner: 10Ladsgroup) [19:31:44] (03CR) 10Ladsgroup: "and that's even the first pass, I will make several others just for LDAP. These changes are only cosmetic ones and they don't change outsi" [puppet] - 10https://gerrit.wikimedia.org/r/278271 (owner: 10Ladsgroup) [19:32:06] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:32:48] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:36:26] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:36:46] yuvipanda: https://gerrit.wikimedia.org/r/#/c/197409/ what have I missed ? [19:37:13] aka: why nodes/labs/integration under the top level puppet repo hierarchy ? [19:37:17] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:43:22] (03PS1) 10Ladsgroup: Flake8 for apt [puppet] - 10https://gerrit.wikimedia.org/r/278753 [19:44:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Flake8 for apt [puppet] - 10https://gerrit.wikimedia.org/r/278753 (owner: 10Ladsgroup) [19:45:20] akosiaris: if you are in pep8/flake8 mood, I had a pending patch to switch the puppet repo to use tox to run pep8 :) [19:46:06] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:16] I am just in a react mood while handling some ORES redis things, not really in a flake8 mood :-( [19:46:25] :D [19:46:57] PROBLEM - HHVM rendering on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:47] PROBLEM - nutcracker process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:56] PROBLEM - nutcracker port on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:06] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:08] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:16] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:17] PROBLEM - DPKG on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:38] PROBLEM - dhclient process on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:47] PROBLEM - HHVM processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:56] PROBLEM - configured eth on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:16] PROBLEM - SSH on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:17] PROBLEM - Disk space on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:46] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:49:48] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 0 % full [19:49:56] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [19:55:07] PROBLEM - salt-minion processes on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:08] PROBLEM - Check size of conntrack table on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:55:17] PROBLEM - RAID on mw1142 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:52] !log powercycle mw1142, console available but not ever prompting for the root password, stuck at username [19:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160321T2000). [20:00:57] RECOVERY - dhclient process on mw1142 is OK: PROCS OK: 0 processes with command name dhclient [20:01:07] RECOVERY - HHVM processes on mw1142 is OK: PROCS OK: 6 processes with command name hhvm [20:01:07] RECOVERY - HHVM rendering on mw1142 is OK: HTTP OK: HTTP/1.1 200 OK - 69920 bytes in 1.198 second response time [20:01:16] RECOVERY - configured eth on mw1142 is OK: OK - interfaces up [20:01:36] RECOVERY - SSH on mw1142 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [20:01:46] RECOVERY - Disk space on mw1142 is OK: DISK OK [20:01:57] RECOVERY - nutcracker process on mw1142 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:02:06] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.106 second response time [20:02:07] RECOVERY - nutcracker port on mw1142 is OK: TCP OK - 0.000 second response time on port 11212 [20:02:16] RECOVERY - salt-minion processes on mw1142 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:02:17] RECOVERY - Check size of conntrack table on mw1142 is OK: OK: nf_conntrack is 9 % full [20:02:27] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [20:02:28] RECOVERY - DPKG on mw1142 is OK: All packages OK [20:03:48] PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 1140 bytes in 0.044 second response time [20:04:30] (03PS1) 10Ladsgroup: Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 [20:04:37] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:05:37] RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2493 bytes in 0.110 second response time [20:06:13] (03PS1) 10Ladsgroup: Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 [20:12:27] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [20:12:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [20:13:07] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [20:23:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:30:06] PROBLEM - DPKG on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:30:07] PROBLEM - puppet last run on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:30:07] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:31:47] RECOVERY - DPKG on analytics1047 is OK: All packages OK [20:31:48] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [20:31:48] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [20:32:37] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [20:33:47] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [20:36:47] (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 (owner: 10Ladsgroup) [20:36:53] (03PS2) 10Alexandros Kosiaris: Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 (owner: 10Ladsgroup) [20:37:06] (03CR) 10Alexandros Kosiaris: [C: 032] Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 (owner: 10Ladsgroup) [20:37:09] (03CR) 10Alexandros Kosiaris: [V: 032] Flake8 and fix bug in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/278754 (owner: 10Ladsgroup) [20:37:32] (03PS2) 10Alexandros Kosiaris: Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 (owner: 10Ladsgroup) [20:37:36] (03CR) 10Alexandros Kosiaris: [V: 032] Flake8 for osm [puppet] - 10https://gerrit.wikimedia.org/r/278755 (owner: 10Ladsgroup) [20:39:11] (03PS1) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 [20:39:13] (03PS1) 10Alexandros Kosiaris: Add the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [20:46:42] akosiaris: thank you :) [21:02:58] PROBLEM - YARN NodeManager Node-State on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:47] PROBLEM - RAID on analytics1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:07:57] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [21:08:58] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [21:09:57] RECOVERY - YARN NodeManager Node-State on analytics1047 is OK: OK: YARN NodeManager analytics1047.eqiad.wmnet:8041 Node-State: RUNNING [21:10:37] RECOVERY - RAID on analytics1047 is OK: OK: optimal, 13 logical, 14 physical [21:10:49] (03CR) 10Alexandros Kosiaris: "Not against this per se, but what is the rationale behind this ?" [puppet] - 10https://gerrit.wikimedia.org/r/278318 (owner: 10Ori.livneh) [21:32:47] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [21:33:46] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042 [22:09:45] (03PS1) 10Ladsgroup: Flake8 on openstack, part I [puppet] - 10https://gerrit.wikimedia.org/r/278761 [22:12:28] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [22:13:27] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [22:13:47] (03PS1) 10Ladsgroup: Flake8 for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/278762 [22:15:10] (03PS2) 10Ladsgroup: Flake8 for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/278762 [22:25:07] (03PS1) 10Ladsgroup: flake8 on icinga [puppet] - 10https://gerrit.wikimedia.org/r/278763 [22:26:27] (03CR) 10Ori.livneh: "Still not sure I need it, to be honest. I was going to look into setting the 'Server: ' header to the app server hostname, instead of just" [puppet] - 10https://gerrit.wikimedia.org/r/278318 (owner: 10Ori.livneh) [22:31:16] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [22:33:47] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.042 second response time on port 9042 [22:56:40] (03PS3) 10Alexandros Kosiaris: stdlib: import deep_merge function [puppet] - 10https://gerrit.wikimedia.org/r/278241 [22:56:42] (03PS2) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [22:56:44] (03PS2) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [22:56:46] (03PS1) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [22:58:03] (03Abandoned) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278242 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [22:58:21] (03Abandoned) 10Alexandros Kosiaris: ores: define slaveof as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/278243 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [23:00:04] RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160321T2300). [23:03:30] no patches listed [23:04:38] (03PS3) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [23:15:55] (03PS4) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [23:26:15] (03PS2) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [23:26:17] (03PS5) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [23:26:19] (03PS3) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [23:28:16] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:40:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:46:02] (03CR) 10Alexandros Kosiaris: [C: 032] flake8 on icinga [puppet] - 10https://gerrit.wikimedia.org/r/278763 (owner: 10Ladsgroup) [23:47:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:51:57] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89413.00 seconds