[00:00:05] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160108T0000). Please do the needful. [00:00:20] nothing to swat [00:01:04] Swat all the things! [00:01:18] that's next week ostriches [00:01:43] first three swats of next week are full [00:02:05] Heh [00:03:34] (03PS3) 10Yuvipanda: extdist: Unbreak [puppet] - 10https://gerrit.wikimedia.org/r/262921 (https://phabricator.wikimedia.org/T123090) (owner: 10Legoktm) [00:03:45] (03CR) 10Yuvipanda: [C: 032 V: 032] extdist: Unbreak [puppet] - 10https://gerrit.wikimedia.org/r/262921 (https://phabricator.wikimedia.org/T123090) (owner: 10Legoktm) [00:04:16] thanks YuviPanda [00:04:42] 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1921091 (10ArielGlenn) See also T123094 on replacing/upgrading the dataset servers, as they are out of warranty. [00:05:13] 6operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#1921093 (10ArielGlenn) a:3ArielGlenn [00:05:23] (03PS4) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [00:05:57] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1921095 (10ArielGlenn) See also T123094 (replacing dataset servers since they are out of warranty). [00:24:12] (03PS1) 10Gergő Tisza: Sentry: really create group [puppet] - 10https://gerrit.wikimedia.org/r/263019 (https://phabricator.wikimedia.org/T85239) [00:28:55] 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1921145 (10elukey) Added my public key to https://office.wikimedia.org/wiki/User:LToscano_(WMF) Thanks a lot! [00:36:11] (03CR) 10Alexandros Kosiaris: [C: 032] network: split frack into its proper subnets [puppet] - 10https://gerrit.wikimedia.org/r/260924 (owner: 10Faidon Liambotis) [00:36:29] (03PS3) 10Alexandros Kosiaris: network: split frack into its proper subnets [puppet] - 10https://gerrit.wikimedia.org/r/260924 (owner: 10Faidon Liambotis) [00:37:25] (03CR) 10Alexandros Kosiaris: [V: 032] network: split frack into its proper subnets [puppet] - 10https://gerrit.wikimedia.org/r/260924 (owner: 10Faidon Liambotis) [00:47:53] (03CR) 10Alexandros Kosiaris: [C: 032] network: add sandbox "realm" [puppet] - 10https://gerrit.wikimedia.org/r/260925 (owner: 10Faidon Liambotis) [00:48:00] (03PS3) 10Alexandros Kosiaris: network: add sandbox "realm" [puppet] - 10https://gerrit.wikimedia.org/r/260925 (owner: 10Faidon Liambotis) [00:48:04] (03CR) 10Alexandros Kosiaris: [V: 032] network: add sandbox "realm" [puppet] - 10https://gerrit.wikimedia.org/r/260925 (owner: 10Faidon Liambotis) [00:50:01] (03CR) 10Mobrovac: "LGTM if we settle on managing the config file set-up from within puppet itself on the first run." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/262742 (owner: 10Alexandros Kosiaris) [00:55:02] (03PS1) 10Ema: etcd.py: remove unused local variable 'e' [debs/pybal] - 10https://gerrit.wikimedia.org/r/263022 [01:11:55] (03PS1) 10Gergő Tisza: logstash: add sentry output plugin [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) [01:12:45] (03CR) 10jenkins-bot: [V: 04-1] logstash: add sentry output plugin [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [01:22:21] (03PS2) 10Gergő Tisza: logstash: add sentry output plugin [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) [01:50:51] (03PS1) 10Gergő Tisza: Improve sentry plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/263027 (https://phabricator.wikimedia.org/T85239) [02:24:46] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 10m 15s) [02:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 8 02:31:46 UTC 2016 (duration 7m 0s) [02:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:36] (03PS2) 10Gergő Tisza: Improve sentry plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/263027 (https://phabricator.wikimedia.org/T85239) [03:35:48] (03PS3) 10Gergő Tisza: [WIP] logstash: send errors to sentry [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) [06:09:47] (03PS5) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [06:10:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 205, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [06:14:53] (03PS6) 10Yuvipanda: Replaced all spacing with tab [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [06:16:55] YuviPanda: ping [06:24:23] (03PS7) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [06:31:45] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:15] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:34] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:55:44] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:34] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:00:52] (03CR) 10Hashar: [C: 031] "Gergo has cherry picked the patch on the beta cluster puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/263012 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [08:06:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:06:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:08:24] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:12:26] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:13:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:13:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:35:44] (03CR) 10Mavrikant: [C: 031] Changed user group rights at trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261869 (https://phabricator.wikimedia.org/T122710) (owner: 10Luke081515) [10:04:51] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1921459 (10Kelson) @Faidon No, we are anonymous but mwoffliner (command line tool) forces to put an email address you can retrieve in the web client user-agent. [10:35:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 809 [10:40:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1533802 Threads: 2 Questions: 41983257 Slow queries: 16720 Opens: 59902 Flush tables: 2 Open tables: 416 Queries per second avg: 27.372 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:03:39] (03PS6) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [11:04:53] (03CR) 10jenkins-bot: [V: 04-1] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:06:19] (03CR) 10Hashar: "Done in PS7" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:06:29] (03PS7) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [11:07:22] (03CR) 10jenkins-bot: [V: 04-1] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:40:32] (03PS1) 10Thiemo Mättig (WMDE): Basic "Identifiers" statement section config for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) [11:58:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 207, down: 0, dormant: 0, excluded: 0, unused: 0 [13:00:45] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [13:02:55] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [13:12:45] (03PS3) 10Mdann52: Add http://webapi.aucklandmuseum.com/ to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) [13:13:05] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:28:49] (03PS1) 10Mdann52: Config canges to wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) [13:29:33] (03PS2) 10Mdann52: Configuration changes to wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) [13:38:14] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:34] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [14:22:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:33:34] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.00% of data above the critical threshold [100000000.0] [14:54:32] (03PS3) 10Luke081515: Configuration changes to wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [14:55:00] (03CR) 10Luke081515: [C: 031] Configuration changes to wuu.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263051 (https://phabricator.wikimedia.org/T122476) (owner: 10Mdann52) [14:56:15] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [16:00:35] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: puppet fail [16:15:05] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [16:29:44] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:38:54] (03CR) 10Jforrester: "Should we be using GND so prominently? It's deprecated and widely seen as a failure…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [16:40:04] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:54:05] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [17:02:55] (03CR) 10Thiemo Mättig (WMDE): "Not sure what you are referring to. It's used in millions of Wikipedia articles as one of the most critical and relevant identifiers, link" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:06:45] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [17:07:15] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "Oh, I just realized we can use the "external-id" data type right away, see I00c38a5." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:08:35] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:08:45] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:05] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:09:06] (03CR) 10Jforrester: "We don't want to "lower" their visibility, we want them deleted: https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Migrating_awa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:09:25] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:35] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [17:10:44] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:11:04] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [17:11:24] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [17:11:28] (03PS2) 10Thiemo Mättig (WMDE): Basic "Identifiers" statement section config for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) [17:15:32] (03CR) 10Nemo bis: Basic "Identifiers" statement section config for Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:16:25] (03CR) 10Thiemo Mättig (WMDE): "Who is "we"? I'm aware there is a controversy going on about GND. Personally I have no idea why people are fighting one of the most useful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:16:45] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 42 failures [17:17:22] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [17:17:31] (03CR) 10Nemo bis: "I think James mixed P107 (which the linked RfC was about) with P227 (which this patch referenced)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:19:21] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 976101 bytes in 5.577 second response time [17:19:24] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [17:21:36] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "The patch, as it is now, can be default configuration. Please do not merge this until we decided how to proceed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [17:32:05] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [17:33:45] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:53:36] (03CR) 10DCausse: [C: 031] [test only] Stricter avro schema tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261296 (owner: 10EBernhardson) [19:06:35] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [19:10:48] (03PS2) 10Dzahn: tor: set family config option [puppet] - 10https://gerrit.wikimedia.org/r/260185 [19:16:12] 10Ops-Access-Requests, 6operations: Access for new Analytics Opsen: Luca Toscano - https://phabricator.wikimedia.org/T122925#1921896 (10RobH) Actually, this includes a number of sudo level access, so it cannot just be the 3 day wait, but requires ops meeting review. As such, this is going to be stalled until... [19:22:24] (03PS1) 10RobH: setup shell/sudo access for new employee Luca Toscano [puppet] - 10https://gerrit.wikimedia.org/r/263078 (https://phabricator.wikimedia.org/T122925) [19:30:04] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: puppet fail [19:36:38] 6operations, 10Gerrit, 10hardware-requests: Need spare server to upgrade/migrate gerrit - https://phabricator.wikimedia.org/T123132#1921926 (10demon) 3NEW [19:39:08] (03CR) 10Dzahn: [C: 032] tor: set family config option [puppet] - 10https://gerrit.wikimedia.org/r/260185 (owner: 10Dzahn) [19:53:17] (03CR) 10Andrew Bogott: "@Tim I would've thought not, but Yuvi says yes -- and my test case worked, at least." [puppet] - 10https://gerrit.wikimedia.org/r/262856 (https://phabricator.wikimedia.org/T121773) (owner: 10Andrew Bogott) [20:18:05] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-tools was over 1 day, 1:00:00 ago [20:18:05] 6operations, 10Traffic, 7HTTPS: Invalid web certificate on status.wikimedia.org - https://phabricator.wikimedia.org/T123135#1921978 (10Josve05a) 3NEW [20:18:19] 6operations, 10Traffic, 7HTTPS: Invalid web certificate on status.wikimedia.org - https://phabricator.wikimedia.org/T123135#1921985 (10Josve05a) [20:18:20] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1921986 (10Josve05a) [20:18:39] 6operations, 10Traffic, 7HTTPS: Invalid web certificate on status.wikimedia.org - https://phabricator.wikimedia.org/T123135#1921987 (10Legoktm) [20:18:41] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1921988 (10Legoktm) [20:19:22] (03PS8) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [20:21:46] 6operations: add user jrabbah@ to strategicpartnerships@ - https://phabricator.wikimedia.org/T122989#1922003 (10Aklapper) Ops team might wonder who jrabbah is, and I don't see a similar name on https://wikimediafoundation.org/wiki/Staff_and_contractors ...? [20:23:45] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run for unit replicate-maps was over 1 day, 1:00:00 ago [20:24:57] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1922011 (10Aklapper) >>! In T122927#1920820, @Dzahn wrote: > Maybe the jimmy@ alias can be moved over to Google and OIT ? @cajoel ^ @JKrauska : ? [20:26:03] 6operations: add user jrabbah@ to strategicpartnerships@ - https://phabricator.wikimedia.org/T122989#1922016 (10Southparkfan) @Aklapper: I guess it's Jack Rabah (https://wikimediafoundation.org/wiki/Staff_and_contractors#Partnerships_and_Wikipedia_Zero)? Small spelling mistake in that case. [20:27:35] 6operations: add user jrabbah@ to strategicpartnerships@ - https://phabricator.wikimedia.org/T122989#1922019 (10Krenair) Yes, email address that I can see in gmail does not have two 'b's [20:27:49] 6operations: add user jrabah@ to strategicpartnerships@ - https://phabricator.wikimedia.org/T122989#1922020 (10Krenair) [20:30:35] PROBLEM - RAID on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:30:35] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:31:05] PROBLEM - configured eth on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:34:55] PROBLEM - dhclient process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:35:24] PROBLEM - nutcracker port on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:25] PROBLEM - Disk space on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:34] PROBLEM - salt-minion processes on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:36:36] PROBLEM - DPKG on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:05] PROBLEM - nutcracker process on mw1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:38:59] 6operations: add user jrabah@ to strategicpartnerships@ - https://phabricator.wikimedia.org/T122989#1922069 (10Southparkfan) [20:42:34] RECOVERY - Disk space on mw1008 is OK: DISK OK [20:42:35] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:42:45] RECOVERY - DPKG on mw1008 is OK: All packages OK [20:43:04] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [20:43:04] RECOVERY - dhclient process on mw1008 is OK: PROCS OK: 0 processes with command name dhclient [20:43:04] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [20:43:25] RECOVERY - configured eth on mw1008 is OK: OK - interfaces up [20:43:26] RECOVERY - nutcracker port on mw1008 is OK: TCP OK - 0.000 second response time on port 11212 [20:44:06] RECOVERY - nutcracker process on mw1008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:45:14] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:48:24] 6operations: add user jrabah@ to strategicpartnerships@ - https://phabricator.wikimedia.org/T122989#1922075 (10eliza) Thank you - yes you are correct - it is jrabah@ (single b) [20:49:55] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:49:55] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:45] RECOVERY - DPKG on mw1013 is OK: All packages OK [20:51:45] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [20:55:16] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1922078 (10eliza) Hello Everyone, I'm also hesitant on doing this at the moment - due to the Wikipedia 15 celebrations. But will suggest with Caitln to do this afterwards if that works for everyone. Eliza [20:56:52] looks like mw1008 and mw1013 completely went down due to OOM? [21:09:25] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:13:25] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 42 minutes ago with 0 failures [21:18:24] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:05] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:19:54] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:55] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:04] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:46] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:14] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:21:25] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:54] RECOVERY - Disk space on mw1013 is OK: DISK OK [21:23:25] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:28:50] (03PS9) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [21:29:04] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:15] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:15] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:29:44] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:32:39] (03PS10) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [21:34:25] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [21:34:34] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [21:34:35] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [21:35:04] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [21:35:15] RECOVERY - Disk space on mw1013 is OK: DISK OK [21:35:25] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:35:26] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [21:35:44] RECOVERY - DPKG on mw1013 is OK: All packages OK [21:35:45] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [21:35:55] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:47:35] PROBLEM - RAID on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:05] PROBLEM - puppet last run on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:05] PROBLEM - configured eth on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:15] PROBLEM - nutcracker port on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:56] PROBLEM - Disk space on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:50:25] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:25] PROBLEM - DPKG on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:50:35] PROBLEM - salt-minion processes on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:14] PROBLEM - dhclient process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:15] PROBLEM - nutcracker process on mw1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:43] (03PS1) 10Papaul: partman: Replaced Tab with spoces [puppet] - 10https://gerrit.wikimedia.org/r/263144 [22:10:45] RECOVERY - Disk space on mw1013 is OK: DISK OK [22:10:55] RECOVERY - dhclient process on mw1013 is OK: PROCS OK: 0 processes with command name dhclient [22:10:55] RECOVERY - nutcracker process on mw1013 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [22:11:14] RECOVERY - DPKG on mw1013 is OK: All packages OK [22:11:15] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:11:24] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:11:55] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [22:11:55] RECOVERY - configured eth on mw1013 is OK: OK - interfaces up [22:12:14] RECOVERY - nutcracker port on mw1013 is OK: TCP OK - 0.000 second response time on port 11212 [22:12:35] RECOVERY - RAID on mw1013 is OK: OK: no RAID installed [22:14:16] 6operations: Adding ltoscano@wikimedia.org to the analytics-alert mailing list - https://phabricator.wikimedia.org/T123141#1922155 (10elukey) 3NEW [22:21:33] (03CR) 10Dzahn: "now shows family link on https://atlas.torproject.org/#details/DB19E709C9EDB903F75F2E6CA95C84D637B62A02" [puppet] - 10https://gerrit.wikimedia.org/r/260185 (owner: 10Dzahn) [22:25:36] (03CR) 10Dzahn: [C: 031] Use a more useful error message when DB connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/251791 (owner: 10Alex Monk) [22:35:04] 6operations: add slien to jimmy alias - https://phabricator.wikimedia.org/T122927#1922212 (10Dzahn) @eliza thanks, that sounds good [22:55:55] (03PS1) 10Alexandros Kosiaris: diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 [23:06:04] (03PS1) 10Legoktm: Add my (legoktm) new yubikey-based ssh key [puppet] - 10https://gerrit.wikimedia.org/r/263151 [23:07:05] (03PS2) 10Alexandros Kosiaris: diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 [23:09:33] (03PS3) 10Alexandros Kosiaris: diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 [23:10:32] (03CR) 10Alexandros Kosiaris: [C: 032] diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 (owner: 10Alexandros Kosiaris) [23:10:37] (03PS4) 10Alexandros Kosiaris: diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 [23:11:39] akosiaris: wanna add my new ssh key? https://gerrit.wikimedia.org/r/#/c/263151/ :) [23:12:53] legoktm: sure [23:13:06] (03CR) 10Alexandros Kosiaris: [C: 032] Add my (legoktm) new yubikey-based ssh key [puppet] - 10https://gerrit.wikimedia.org/r/263151 (owner: 10Legoktm) [23:16:13] (03PS5) 10Alexandros Kosiaris: diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 [23:16:17] (03CR) 10Alexandros Kosiaris: [V: 032] diamond: Introduce an etherpad plugin [puppet] - 10https://gerrit.wikimedia.org/r/263149 (owner: 10Alexandros Kosiaris) [23:26:49] 6operations: Adding ltoscano@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T123141#1922253 (10Aklapper) [23:28:22] (03PS11) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [23:30:22] (03PS12) 10Papaul: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) [23:38:12] (03PS13) 10RobH: Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [23:38:38] (03CR) 10RobH: [C: 032] Replaced all spacing with tab Bug:T121879 [puppet] - 10https://gerrit.wikimedia.org/r/262998 (https://phabricator.wikimedia.org/T121879) (owner: 10Papaul) [23:42:42] (03PS1) 10Alexandros Kosiaris: etherpad diamond collector: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/263154 [23:43:01] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad diamond collector: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/263154 (owner: 10Alexandros Kosiaris) [23:43:17] (03PS2) 10Alexandros Kosiaris: etherpad diamond collector: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/263154 [23:43:23] (03CR) 10Alexandros Kosiaris: [V: 032] etherpad diamond collector: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/263154 (owner: 10Alexandros Kosiaris) [23:49:52] !log stalled puppet on carbon for now, messing with partman files [23:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, RobH