[00:04:36] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:07:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 72, down: 3, shutdown: 0BRPeering with AS13489 not established - The + flag cannot be used with the sub-query features described below.BRPeering with AS1273 not established - CWBRPeering with AS26972 not established - The + flag cannot be used with the sub-query features described below.BR [00:15:57] (03PS2) 10Awight: Remove deprecated Fundraising thermometer config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/233900 [00:18:29] 6operations, 10Wikimedia-Mailing-lists: move sodium backup to archive pool? - https://phabricator.wikimedia.org/T113828#1676993 (10Dzahn) 3NEW a:3Dzahn [00:20:01] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Decommission sodium - https://phabricator.wikimedia.org/T110142#1677004 (10Dzahn) We have an existing backup of /var/lib/mailman in bacula. additionally i have /root , /etc/, /home , /usr/local and /var/log/mailman as .tar.gz files in my home on iro... [00:25:31] (03PS1) 10Alex Monk: Filter example ircnick from patch owners list [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/241254 [00:40:57] (03PS1) 10Dduvall: Filter target host logging from stdout of main process [tools/scap] - 10https://gerrit.wikimedia.org/r/241256 (https://phabricator.wikimedia.org/T113779) [00:42:13] (03PS2) 10Dduvall: Filter target host logging from stdout of main process [tools/scap] - 10https://gerrit.wikimedia.org/r/241256 (https://phabricator.wikimedia.org/T113779) [00:55:28] (03PS3) 10Gergő Tisza: LDAP support [software/sentry] - 10https://gerrit.wikimedia.org/r/240949 (https://phabricator.wikimedia.org/T97133) [00:55:30] (03PS1) 10Gergő Tisza: Relocate the virtualenv to /srv/sentry [software/sentry] - 10https://gerrit.wikimedia.org/r/241259 [01:12:49] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Expand shell access for aklapper on Phabricator - https://phabricator.wikimedia.org/T113124#1677077 (10yuvipanda) And on @aklapper finding time to test and verify :) [01:40:45] (03PS1) 10Gergő Tisza: Fixes to sentry role [puppet] - 10https://gerrit.wikimedia.org/r/241260 [01:52:57] (03CR) 10Chad: [C: 032] Comment typofix [tools/scap] - 10https://gerrit.wikimedia.org/r/241114 (owner: 10Chad) [01:53:12] (03Merged) 10jenkins-bot: Comment typofix [tools/scap] - 10https://gerrit.wikimedia.org/r/241114 (owner: 10Chad) [02:22:45] (03PS1) 10coren: toolserver-legacy: exim4 needs SMTP running [puppet] - 10https://gerrit.wikimedia.org/r/241263 (https://phabricator.wikimedia.org/T113756) [02:23:00] !log l10nupdate@tin Synchronized php-1.26wmf24/cache/l10n: l10nupdate for 1.26wmf24 (duration: 06m 48s) [02:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:12] yuvipanda: ^^ [02:25:49] (03CR) 10Yuvipanda: [C: 031] "ok but we should *really* move it to its own module..." [puppet] - 10https://gerrit.wikimedia.org/r/241263 (https://phabricator.wikimedia.org/T113756) (owner: 10coren) [02:26:47] (03PS2) 10coren: toolserver-legacy: exim4 needs SMTP running [puppet] - 10https://gerrit.wikimedia.org/r/241263 (https://phabricator.wikimedia.org/T113756) [02:28:21] (03CR) 10coren: [C: 032] toolserver-legacy: exim4 needs SMTP running [puppet] - 10https://gerrit.wikimedia.org/r/241263 (https://phabricator.wikimedia.org/T113756) (owner: 10coren) [03:05:06] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:22:55] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:52:45] (03PS2) 10Ori.livneh: Fixes to sentry role [puppet] - 10https://gerrit.wikimedia.org/r/241260 (owner: 10Gergő Tisza) [03:52:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Fixes to sentry role [puppet] - 10https://gerrit.wikimedia.org/r/241260 (owner: 10Gergő Tisza) [03:53:37] (03CR) 10Ori.livneh: [C: 032] Relocate the virtualenv to /srv/sentry [software/sentry] - 10https://gerrit.wikimedia.org/r/241259 (owner: 10Gergő Tisza) [04:01:47] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1677153 (10Andrew) 3NEW [04:04:07] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:09:46] (03PS2) 10Ori.livneh: Simplify sentry module [puppet] - 10https://gerrit.wikimedia.org/r/240150 [04:14:11] (03CR) 10Ori.livneh: [C: 032] Simplify sentry module [puppet] - 10https://gerrit.wikimedia.org/r/240150 (owner: 10Ori.livneh) [04:17:55] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [04:26:46] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:57:36] PROBLEM - puppet last run on mw2033 is CRITICAL: CRITICAL: puppet fail [06:26:16] RECOVERY - puppet last run on mw2033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:29:47] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 4 failures [06:29:56] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [06:30:36] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:06] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:35] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:46] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:32:26] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:26] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:26] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:46] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:38:15] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:54:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61432 bytes in 0.421 second response time [06:56:46] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:05] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:58:15] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:17] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:26] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:04:57] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:19:27] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [09:13:26] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1677271 (10Superyetkin) 5Resolved>3Open How about other wikis? Turkish Wikipedia does not seem to be getting [[https://tr.wiki... [10:32:43] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1677300 (10Reedy) >>! In T113721#1677271, @Superyetkin wrote: > How about other wikis? Turkish Wikipedia does not seem to be getti... [10:38:42] mutante: Mailman queue going above 100? Heh... [11:34:45] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1677328 (10hashar) A gotcha is the puppet-lint check for boolean is only a warning. In Jenkins we have two jobs: puppetlint-lenient : solely process errors... [12:16:53] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1677377 (10Krenair) 5Open>3Resolved >>! In T113721#1677271, @Superyetkin wrote: > How about other wikis? Turkish Wikipedia doe... [13:59:04] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#1677477 (10Nemo_bis) [14:50:37] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [15:04:56] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1678236 (10Dzahn) The `monitoring::host` class has a `contact_group` parameter, like the `monitoring::service` class does. ``` monitoring::host { $::hostname: contact_group => $contact_group ``` and this is used in... [15:19:36] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:22:56] 6operations, 7Database: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#1678266 (10jcrespo) And again, on db1069:3313: ``` Last_Error: Error 'Incorrect key file for table 'user_properties'; try to repair it' on... [15:23:30] (03PS1) 10Dzahn: hadoop: fix some lint issues [puppet] - 10https://gerrit.wikimedia.org/r/241315 [15:34:20] (03PS1) 10Dzahn: analytics roles: some more lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/241318 [16:11:58] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1678308 (10Glaisher) [16:55:06] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [16:55:17] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail [17:14:57] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [17:22:06] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:22:16] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:41:56] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:46:18] (03PS1) 10Se4598: Restore previous AbuseFilter customs IP Block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) [18:50:16] (03CR) 10Se4598: "this may have caused a regression for wikis with custom wgAbuseFilterBlockDuration because the previous (null) value was a fallback indica" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240053 (https://phabricator.wikimedia.org/T113164) (owner: 10MarcoAurelio) [18:51:40] (03PS2) 10Se4598: Restore previous custom AbuseFilter IP Block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) [19:01:49] (03CR) 10Luke081515: [C: 031] Restore previous custom AbuseFilter IP Block durations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241354 (https://phabricator.wikimedia.org/T113848) (owner: 10Se4598) [19:04:29] !log restarting Jenkins. Just in case :-D [19:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:16] RECOVERY - Disk space on labstore1002 is OK: DISK OK [19:55:34] hey ori, could you review https://gerrit.wikimedia.org/r/#/c/240939/ when you have time please? [20:24:47] PROBLEM - puppet last run on mw2026 is CRITICAL: CRITICAL: puppet fail [20:53:37] RECOVERY - puppet last run on mw2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:06:06] (03CR) 10Ori.livneh: [C: 04-1] tcpircbot: Allow per-infile channel lists (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/240939 (owner: 10Alex Monk) [22:14:47] PROBLEM - Hadoop DataNode on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [22:42:25] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Puppet has 1 failures