[00:01:05] (03PS4) 10Dzahn: put installserver::dhcp on install1001, install2001 [puppet] - 10https://gerrit.wikimedia.org/r/305431 (https://phabricator.wikimedia.org/T132757) [00:07:13] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2569060 (10Dzahn) install1001 and install2001 are now DHCP servers isc-dhcp-server got installed and /etc/dhcp/ got populated. the datacenter-ops group got access on them be... [00:21:30] (03PS2) 10RobH: robh on vacation next week, remove from paging [puppet] - 10https://gerrit.wikimedia.org/r/305734 [00:21:50] (03CR) 10RobH: [C: 032] robh on vacation next week, remove from paging [puppet] - 10https://gerrit.wikimedia.org/r/305734 (owner: 10RobH) [00:41:03] 06Operations, 10Ops-Access-Requests: Access needed for people.wikimedia.org for showcasing - https://phabricator.wikimedia.org/T143465#2569112 (10Volker_E) [01:04:54] (03PS2) 10Mattflaschen: Set Flow as default for User talk on kabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305670 (https://phabricator.wikimedia.org/T140588) [02:19:21] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:21:21] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [02:27:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [02:29:12] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.15) (duration: 09m 46s) [02:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Aug 20 02:35:10 UTC 2016 (duration 5m 58s) [02:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:59:21] PROBLEM - Disk space on scb1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [05:07:18] 06Operations, 10Ops-Access-Requests: Access needed for people.wikimedia.org for showcasing - https://phabricator.wikimedia.org/T143465#2569157 (10Dzahn) This request ist reasonable to me, we talked. Since the latest changes this means it does not need any group membership, no sudo, no nothing. Just create the... [06:26:22] RECOVERY - Disk space on scb1001 is OK: DISK OK [06:42:52] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Puppet has 1 failures [07:08:42] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:37:45] (03PS1) 10Alex Monk: deployment-prep: remove old deployment-fluorine hieradata [puppet] - 10https://gerrit.wikimedia.org/r/305765 [07:38:44] (03PS1) 10Alex Monk: udp2log::instance: require psmisc command [puppet] - 10https://gerrit.wikimedia.org/r/305766 [07:39:49] (03PS2) 10Alex Monk: udp2log::instance: require psmisc package for use of killall command [puppet] - 10https://gerrit.wikimedia.org/r/305766 [07:44:03] (03PS1) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [07:45:17] (03CR) 10jenkins-bot: [V: 04-1] Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [07:48:52] (03PS2) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [07:52:38] (03PS1) 10Alex Monk: mw-log-cleanup: remove wfDebug files in deployment-prep every week [puppet] - 10https://gerrit.wikimedia.org/r/305768 [07:54:31] (03CR) 10Alex Monk: "cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/305767 (owner: 10Alex Monk) [07:57:04] (03PS2) 10Alex Monk: mw-log-cleanup: remove wfDebug files in deployment-prep every week [puppet] - 10https://gerrit.wikimedia.org/r/305768 [07:57:06] (03PS3) 10Alex Monk: Remove the hard-coded /a/mw-log references scattered around everywhere [puppet] - 10https://gerrit.wikimedia.org/r/305767 [08:00:08] * Krenair disappears again [08:03:03] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#2569178 (10AlexMonk-WMF) [08:03:05] 06Operations, 13Patch-For-Review: Do not require people to be explicitly added to the bastiononly group - https://phabricator.wikimedia.org/T114161#2569177 (10AlexMonk-WMF) 05Open>03Resolved [10:28:30] Hi, there is currently a lot of fatal error with an issue with Proofreadpage. [10:28:35] 653 error: Couldn't find constant ProofreadPage::AS_HOOK_ERROR in /srv/mediawiki/php-1.28.0-wmf.15/extensions/ProofreadPage/ProofreadPage.body.php on line [10:28:38] 625 [10:29:06] A fix is available, but not yet reviewed or merged in master, that's tracked at https://phabricator.wikimedia.org/T143471 [11:12:42] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [11:15:07] For ProofreadPage::AS_HOOK_ERROR error, a fix has been merged to master, and backported for wmf.15 as https://gerrit.wikimedia.org/r/#/c/305776/ [11:15:20] We're at 843 errors in the cluster. [11:38:42] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:42:59] (03PS1) 10Urbanecm: Enable transwiki upload for tcywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305777 (https://phabricator.wikimedia.org/T143397) [11:52:15] Follow-up for the ProofreadPage::AS_HOOK_ERROR error, Tpt asserted the issue shouldn't occur for humans contributors, only for bots. It so advices we can wait a regular SWAT windows. [11:52:30] Meanwhile, fatalmonitor error count is at 910. [12:18:38] (03PS1) 10Urbanecm: [cleanup] Remove old throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305780 [12:26:02] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [12:28:12] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [13:04:59] (03PS1) 10Dereckson: Fix user namespaces on Slovak Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305785 (https://phabricator.wikimedia.org/T143472) [13:33:59] !log rolling, depooled restarts for varnish-frontends: all text, upload in esams, eqiad, codfw. [13:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:13] PROBLEM - salt-minion processes on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:21] PROBLEM - Hadoop DataNode on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:43] PROBLEM - Hadoop NodeManager on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:56:13] PROBLEM - MegaRAID on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:56:41] PROBLEM - dhclient process on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:03] PROBLEM - YARN NodeManager Node-State on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:22] PROBLEM - puppet last run on analytics1048 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:01] RECOVERY - YARN NodeManager Node-State on analytics1048 is OK: OK: YARN NodeManager analytics1048.eqiad.wmnet:8041 Node-State: RUNNING [13:59:22] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures [13:59:22] RECOVERY - salt-minion processes on analytics1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:59:31] RECOVERY - Hadoop DataNode on analytics1048 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [13:59:53] RECOVERY - Hadoop NodeManager on analytics1048 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:00:31] RECOVERY - MegaRAID on analytics1048 is OK: OK: optimal, 13 logical, 14 physical [14:01:01] RECOVERY - dhclient process on analytics1048 is OK: PROCS OK: 0 processes with command name dhclient [14:48:41] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:02] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:03:12] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [16:11:12] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] [17:30:56] (03PS1) 10Glaisher: Enable T143073 debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305801 (https://phabricator.wikimedia.org/T143073) [19:00:39] (03PS1) 10Yuvipanda: [WIP] Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [19:06:11] 06Operations, 10MediaWiki-extensions-UniversalLanguageSelector, 10Traffic, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2569690 (10Nikerabbit) >>! In T143270#2567661, @BBlack wrote: > ULS already supports the freegeoip format? Yes as far as... [19:10:09] Nikerabbit: https://logstash.wikimedia.org/goto/d473b641f2aeae7dda5eb47549d4b12b looks like those errors are gone [19:14:33] AaronSchulz: same for pinglimiter [19:15:10] AaronSchulz: frontend fixes were deployed day or two later, that's when it all went quiet [19:27:19] (03PS2) 10Yuvipanda: [WIP] Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [19:44:37] (03PS3) 10Paladox: Bring back ostriches (Chad) change with no "" [puppet] - 10https://gerrit.wikimedia.org/r/304977 [19:49:38] (03PS3) 10Yuvipanda: [WIP] Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [19:52:37] (03PS4) 10Yuvipanda: [WIP] Introduce 'clush' module and toollabs role [puppet] - 10https://gerrit.wikimedia.org/r/305804 [19:59:43] PROBLEM - Disk space on scb1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [21:25:52] (03PS2) 10Dereckson: Restrict local upload on ar.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305573 (https://phabricator.wikimedia.org/T142450) [21:26:36] (03CR) 10Dereckson: "PS2: Updated upload navigation link to /wiki/ويكيبيديا:رفع" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305573 (https://phabricator.wikimedia.org/T142450) (owner: 10Dereckson) [22:17:23] probably innocent :s [22:44:02] (03PS1) 10Alex Monk: letsencrypt: de-duplicate subjects in acme-setup [puppet] - 10https://gerrit.wikimedia.org/r/305833 [23:14:13] (03PS1) 10Alex Monk: deployment-prep: Move poolcounter to deployment-poolcounter02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/305837