[00:00:15] <AndyRussG>	 RoanKattouw: whee that was fast! K we'll be checking...
[00:07:44] <grrrit-wm>	 (03PS1) 10BBlack: text VCL: protect mobile cache from text pollution [puppet] - 10https://gerrit.wikimedia.org/r/258648 (https://phabricator.wikimedia.org/T109286) 
[00:08:13] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] text VCL: protect mobile cache from text pollution [puppet] - 10https://gerrit.wikimedia.org/r/258648 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack)
[00:08:49] <AndyRussG>	 RoanKattouw: K the new code is live, looks fine so far
[00:17:40] <AndyRussG>	 RoanKattouw_away: all good! thanks so much again :)
[00:17:48] <icinga-wm>	 PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server
[00:20:46] <Krenair>	 hm
[00:20:48] <Krenair>	 that's ldap-labs.eqiad.wikimedia.org
[00:21:42] <Krenair>	 and labs login just broke
[00:21:52] <Krenair>	 Coren, YuviPanda, andrewbogott: ^
[00:29:20] <paravoid>	 !log restarting slapd on seaborgium with manual hack
[00:29:28] <icinga-wm>	 RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.009 seconds response time
[00:32:22] <icinga-wm>	 PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures
[00:32:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:33:41] <andrewbogott>	 paravoid: was it crashed or stuck?
[00:34:12] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: openldap: add a 10-minute idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/258652 
[00:34:19] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[00:34:21] <mutante>	 andrewbogott: it had opened too many files
[00:34:28] <icinga-wm>	 RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[00:34:42] <andrewbogott>	 mutante: slapd?  interesting...
[00:34:42] <paravoid>	 did noone add the diamond collector yet?
[00:34:54] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1874796 (10JGirault) Works, I just logged in :)
[00:35:31] <andrewbogott>	 I don’t know.  Coren, YuviPanda, was one of you writing the diamond bits for ldap?
[00:44:53] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "Could not find template 'maps/grants.sql.erb' at /mnt/jenkins-workspace/puppet-compiler/1477/change/src/manifests/role/maps.pp:77 on node " [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn)
[00:47:59] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[01:11:04] <YuviPanda>	 andrewbogott: Coren said he almost had it done?
[01:11:19] <andrewbogott>	 YuviPanda: yeah, that sounds familiar
[01:11:22] <paravoid>	 I'll do it
[01:13:31] * YuviPanda reads backscroll
[01:13:37] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[01:16:25] <andrewbogott>	 ^ just as that fired, a user in #labs says that his session is lagging.
[01:16:25] <andrewbogott>	 I don’t understand how that’s related, I’ve never seen that alert before
[01:16:26] <YuviPanda>	 it's unrelated but maybe similar cause since they all hit NFS
[01:16:26] <YuviPanda>	 which hits LDAP
[01:16:26] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:16:26] <YuviPanda>	 so LDAP slows, NFS slows, knowk on effects
[01:16:39] <andrewbogott>	 right, but the ldap alert isn’t firing
[01:16:48] <YuviPanda>	 it just needs to be slow enough, I guess
[01:16:56] <paravoid>	 I restarted slapd a couple of times
[01:17:09] <paravoid>	 but... yeah, this isn't supposed to happen on just a simple restart
[01:17:33] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 957096 bytes in 3.027 second response time
[01:17:37] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[01:17:52] <YuviPanda>	 yeah, everything is back now
[01:18:01] <YuviPanda>	 I wonder if this is nscd on labstore1001 doing something fun
[01:18:44] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: openldap: add diamond collector for grabbing stats [puppet] - 10https://gerrit.wikimedia.org/r/258657 
[01:19:09] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] openldap: add a 10-minute idle timeout [puppet] - 10https://gerrit.wikimedia.org/r/258652 (owner: 10Faidon Liambotis)
[01:19:40] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] openldap: add diamond collector for grabbing stats [puppet] - 10https://gerrit.wikimedia.org/r/258657 (owner: 10Faidon Liambotis)
[01:20:13] <paravoid>	 I'll leave in a bit -- if thinks get funky, I'd start by reverting the idletimeout patch
[01:20:20] <paravoid>	 the monitor patch, I don't expect to have much effect
[01:20:33] <andrewbogott>	 I’d hope not :)
[01:21:46] <YuviPanda>	 should we make the ldap check paging?
[01:21:51] <YuviPanda>	 I guess the tools-home check pages
[01:22:06] <paravoid>	 I'll restart slapd one finale time I'm afraid
[01:22:27] <paravoid>	 final*
[01:25:32] <paravoid>	 https://graphite.wikimedia.org/render/?width=586&height=308&target=servers.seaborgium.openldap.conns.current
[01:25:32] <mutante>	 are you not getting that page, YuviPanda ?
[01:25:32] <YuviPanda>	 which page, mutante
[01:25:33] <YuviPanda>	 ldap on seaborgium?
[01:25:33] <mutante>	 tools-home
[01:25:33] <YuviPanda>	 mutante: yes, I was talking about the direct ldap check
[01:25:34] <mutante>	 ok, you phrased that in that way that sounded like it might not work
[01:25:35] <mutante>	 and we were wondering about the gateway 
[01:27:18] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[01:28:11] <paravoid>	 why does that fail so often?
[01:28:45] <YuviPanda>	 I think it just hits LDAP
[01:28:50] <YuviPanda>	 it's been a while let me look
[01:29:07] <paravoid>	 does it crash when LDAP disconnects?
[01:29:08] <paravoid>	 it shouldn't
[01:29:10] <YuviPanda>	 Dec 12 01:22:43 labstore1001 create-dbusers[12939]: ldap3.core.exceptions.LDAPSessionTerminatedByServer: session terminated by server
[01:29:11] <paravoid>	 it should reconnect
[01:29:12] <YuviPanda>	 yeah
[01:29:16] <YuviPanda>	 let me fix that
[01:29:21] <paravoid>	 <3
[01:30:27] <grrrit-wm>	 (03PS4) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 
[01:31:27] <YuviPanda>	 paravoid: it's also holding one connection open forever. let me make it close it at the end of each run, and also space out the runs more
[01:32:01] <grrrit-wm>	 (03CR) 10Dzahn: "@akosiaris the templates were in "./manifests/templates/" inside the role module. it did not find them there.. moved them to ./templates/m" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn)
[01:32:03] <paravoid>	 that's not necessarily a problem
[01:32:23] <paravoid>	 in fact it will probably save some resources
[01:32:34] <paravoid>	 for frequent readers/writes keeping the connection open is a good idea
[01:32:39] <grrrit-wm>	 (03CR) 10Alex Monk: "Added some missing newlines" [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul)
[01:32:47] <paravoid>	 saves you from establishing a new connection every time (TCP, TLS negotiation etc.)
[01:33:15] <paravoid>	 otoh an open connection does consume finite resources, so if you're only doing a query every hour, it's probably better to connect/query/close
[01:33:20] <paravoid>	 it's a tradeoff essentially
[01:34:52] <grrrit-wm>	 (03PS1) 10Yuvipanda: labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 
[01:35:08] <grrrit-wm>	 (03CR) 10Dzahn: "Error: Role class role::maps not found at /mnt/jenkins-workspace/puppet-compiler/1478/change/src/manifests/site.pp:1746 on node maps-test2" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn)
[01:35:10] <YuviPanda>	 paravoid: hmm, so ^
[01:36:28] <grrrit-wm>	 (03PS5) 10Dzahn: (WIP) maps: move roles into autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/249059 
[01:36:34] <YuviPanda>	 paravoid: let me know what you think of that patch. I can also make it hit LDAP every 5 mins instead of 2min
[01:36:36] <YuviPanda>	 nbd
[01:38:44] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1874852 (10Dzahn) >>! In T118998#1874796, @JGirault wrote: > Works, I just logged in :)...
[01:38:50] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1874855 (10Dzahn) 5Open>3Resolved
[01:38:59] <wikibugs>	 10Ops-Access-Requests, 6operations: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1815162 (10Dzahn)
[01:39:27] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874859 (10Dzahn) a:5ArielGlenn>3Dzahn
[01:41:59] <wikibugs>	 6operations, 6Labs, 10netops, 5Patch-For-Review: Create labs baremetal subnet? - https://phabricator.wikimedia.org/T121237#1874862 (10Dzahn)
[01:45:42] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-2] "http://puppet-compiler.wmflabs.org/1479/" [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn)
[01:47:07] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[02:01:32] <wikibugs>	 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1874887 (10Dzahn) @Lydia_Pintscher  i checked our mail logs, and i can confirm there are 4 emails that have:  - been sent to lydia.pintscher@wikimedia.org - on December 10 - origin is wikidata wiki...
[02:08:35] <wikibugs>	 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1874898 (10Dzahn) @Hoo i can't confirm an issue on our side. i also see mail delivered to you on that day, but @wikimedia.de and both are H=mxlb.ispgateway.de which other account (gmail vs. 1and1)...
[02:13:28] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874912 (10Dzahn) >>! In T119404#1825404, @Krenair wrote: > statistics-users is not necessary, researchers is the right grou...
[02:16:12] <grrrit-wm>	 (03PS3) 10Dzahn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)
[02:17:32] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)
[02:17:57] <grrrit-wm>	 (03PS4) 10Dzahn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)
[02:18:37] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874918 (10Krenair) researchers gives access to both stat1003 itself and the mysql credentials. I have the same thing myself...
[02:19:38] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874919 (10GWicke)
[02:20:05] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on db2016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table ops.event_log: Cant find record in event_log, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1057-bin.001281, end_log_pos 721053864
[02:20:44] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke)
[02:21:04] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874921 (10Dzahn) Thanks. Then the researchers group has been changed at some point to include stat1003. Moving and and merg...
[02:21:51] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 08m 31s)
[02:21:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:23:57] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874922 (10GWicke)
[02:28:25] <grrrit-wm>	 (03PS5) 10Dzahn: add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)
[02:28:44] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Dec 12 02:28:44 UTC 2015 (duration 6m 53s)
[02:28:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:28:59] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] add matmarex to researchers [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)
[02:39:01] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874943 (10Dzahn) 5Open>3Resolved [stat1003:~] $ id matmarex uid=2501(matmarex) gid=500(wikidev) groups=500(wikidev),714...
[02:39:17] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874945 (10Dzahn)
[02:41:31] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874947 (10BBlack) The task is vague, can you give specific example scenarios or something?  I'm assuming services would run as regular users and not have root...
[02:51:51] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Multimedia: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1874950 (10Dzahn) @matmarex your user exists on stat1003.eqiad.wmnet now  the mysql credentials are here, i confirmed i can read them as your use...
[02:56:09] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] labstore: Do not re-use connections for create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/258658 (owner: 10Yuvipanda)
[02:58:29] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: puppet fail
[03:02:15] <legoktm>	 !log ran fixDefaultJsonContentPages.php --wiki=thwiktionary for T108663
[03:02:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:02:35] <grrrit-wm>	 (03PS1) 10coren: Reorder modules in common-account [puppet] - 10https://gerrit.wikimedia.org/r/258663 
[03:02:48] <Coren>	 YuviPanda: andrewbogott: That one &&
[03:03:22] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 031] Reorder modules in common-account [puppet] - 10https://gerrit.wikimedia.org/r/258663 (owner: 10coren)
[03:04:33] <grrrit-wm>	 (03CR) 10coren: [C: 032] "Simple fix." [puppet] - 10https://gerrit.wikimedia.org/r/258663 (owner: 10coren)
[03:06:04] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874974 (10GWicke) > The task is vague, can you give specific example scenarios or something?  Anybody controlling one of these services has at least code execu...
[03:15:27] <andrewbogott>	 g’night all!
[03:16:12] <wikibugs>	 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1874977 (10hoo) >>! In T121105#1874898, @Dzahn wrote: > @Hoo i can't confirm an issue on our side. i also see mail delivered to you on that day, but @wikimedia.de and both are H=mxlb.ispgateway.de...
[03:21:37] <ori>	 good night andrewbogott
[03:26:07] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[03:26:52] <Josve05a_night>	 I'm getting intermittent 429's...
[03:30:15] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874987 (10Krenair) Do we actually care whether a service is developed by volunteers or not?
[03:31:42] <Krenair>	 Josve05a_night, are you sending lots of requests?
[03:33:06] <Josve05a_night>	 not that I know of
[03:33:18] <Josve05a_night>	 seems to be good now though...
[03:37:19] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 2 failures
[03:39:02] <grrrit-wm>	 (03PS4) 10Madhuvishy: [WIP] apache: Add role to serve static sites on multiple hosts using apache [puppet] - 10https://gerrit.wikimedia.org/r/258096 
[03:54:01] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874992 (10GWicke) > Do we actually care whether a service is developed by volunteers or not?  @Krenair: I agree with you that expertise and responsiveness of m...
[04:12:19] <icinga-wm>	 PROBLEM - puppet last run on mw2078 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:37:58] <icinga-wm>	 RECOVERY - puppet last run on mw2078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:40:40] <wikibugs>	 6operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1875004 (10JanZerebecki) >>! In T111654#1844080, @jcrespo wrote: > Of all ciphers, only a few work:  Nearly all of the not working ones are EC based. Either needing an ECDSA key (you likely d...
[05:04:18] <icinga-wm>	 PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail
[05:12:19] <icinga-wm>	 PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:26:38] <icinga-wm>	 PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:29:39] <icinga-wm>	 RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[05:32:08] <icinga-wm>	 PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: puppet fail
[05:37:32] <grrrit-wm>	 (03PS1) 10Revi: Noindex for User namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) 
[05:37:57] <icinga-wm>	 RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[05:52:09] <icinga-wm>	 RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:57:37] <icinga-wm>	 RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:08:59] <grrrit-wm>	 (03CR) 10Glaisher: Noindex for User namespace in kowiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi)
[06:11:29] <grrrit-wm>	 (03PS2) 10Revi: Noindex for User namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) 
[06:19:57] <icinga-wm>	 PROBLEM - Disk space on restbase1004 is CRITICAL: DISK CRITICAL - free space: /var 105668 MB (3% inode=99%)
[06:20:10] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on db2016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table ops.event_log: Cant find record in event_log, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1057-bin.001281, end_log_pos 721053864
[06:21:32] <grrrit-wm>	 (03CR) 10Glaisher: [C: 031] Noindex for User namespace in kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi)
[06:21:57] <icinga-wm>	 RECOVERY - Disk space on restbase1004 is OK: DISK OK
[06:25:29] <icinga-wm>	 PROBLEM - Disk space on elastic1008 is CRITICAL: DISK CRITICAL - free space: / 630 MB (2% inode=95%)
[06:27:18] <icinga-wm>	 PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%)
[06:27:28] <icinga-wm>	 RECOVERY - Disk space on elastic1008 is OK: DISK OK
[06:30:28] <icinga-wm>	 PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%)
[06:30:28] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:30:47] <icinga-wm>	 PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%)
[06:30:58] <icinga-wm>	 PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:08] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:18] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail
[06:31:28] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:38] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:38] <icinga-wm>	 PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:49] <icinga-wm>	 PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:57] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:37] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:38] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:56:28] <icinga-wm>	 RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:48] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:56:58] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[06:57:08] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[06:57:08] <icinga-wm>	 RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:19] <icinga-wm>	 RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:28] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[06:57:58] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:08] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[06:58:38] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:36:39] <icinga-wm>	 PROBLEM - puppet last run on elastic1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:02:17] <icinga-wm>	 RECOVERY - puppet last run on elastic1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[08:15:26] <grrrit-wm>	 (03PS1) 10Dereckson: Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) 
[08:22:39] <grrrit-wm>	 (03PS3) 10Dereckson: Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi)
[08:22:47] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi)
[08:23:46] <grrrit-wm>	 (03PS4) 10Revi: Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) 
[08:24:06] <grrrit-wm>	 (03CR) 10Revi: "Fixes ~~~~ is to trigger autoclose of phabricator." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi)
[08:24:27] <revi>	 duh, pingspam
[08:25:20] <revi>	 and T121301 is (strictly speaking) unrelated to T92798
[08:25:29] <revi>	 completely different req
[08:27:04] <Dereckson>	 revi: the two are about configure namespaces
[08:27:34] <revi>	 yes but you don't add see also to all previous related bugs
[08:27:59] <Dereckson>	 revi: it allows to track the similar requests (any namespace-related confg change for a wiki), like on the "see also" field on Bugzilla.
[08:28:33] <revi>	 If you want to track similar req, I think project or tracking bug.
[08:29:17] <Dereckson>	 revi: We try to avoid to sue tracking bugs on Phabricator. A project namespaces-for-kowiki seems a little expensive to link two or three requests.
[08:29:21] <revi>	 I wanted to set tracking bug for kowiki but I didn't know how to make it in bz and after phab... I'm waiting for some bug (I forgot teh no.) about policy for per-wiki/per-language project
[08:29:41] <revi>	 well
[08:29:59] <revi>	 I have at least 3 or 5 for ns stuff of kowiki
[08:30:02] <revi>	 (bugs)
[08:30:03] <Dereckson>	 Oh okay. Some other local community created a tracking bug for this use, and add any new bug as blocing the former
[08:30:22] <revi>	 see project #commons, #wikisource
[08:30:29] <revi>	 for example
[08:31:04] <Dereckson>	 I imagine the project is a good idea when a project board is needed.
[08:31:23] <revi>	 yeah but that policy is under discussion for a year and more
[08:33:15] <revi>	 and to counterexample of '17:29:17 <Dereckson> revi: We try to avoid to sue tracking bugs on Phabricator. A project namespaces-for-kowiki seems a little expensive to link two or three requests.' T57342, T57914, T87528
[08:33:27] <revi>	 and more kowiki requests... anyway.
[08:34:37] <grrrit-wm>	 (03PS1) 10Dereckson: Set site name on sr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258670 (https://phabricator.wikimedia.org/T121278) 
[08:39:42] <wikibugs>	 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1875098 (10tomasz) Removing myself and adding the other Tomasz instead :-)
[09:01:59] <godog>	 !log move old elasticsearch logs on elastic1012 out to /var/lib/elasticsearch/log (/ is full)
[09:02:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:02:08] <icinga-wm>	 RECOVERY - Disk space on elastic1012 is OK: DISK OK
[09:06:51] <godog>	 !log move old elasticsearch logs on elastic1016 out to /var/lib/elasticsearch/log (/ is full)
[09:06:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:07:35] <akosiaris>	 mutante|away: you always use ganeti01.svc.$::site.wmnet 
[09:07:50] <akosiaris>	 it will always point you to the correct master
[09:08:16] <grrrit-wm>	 (03PS1) 10Dereckson: Enable Wikilove on az.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) 
[09:08:41] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "Blocked on community consensus." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258671 (https://phabricator.wikimedia.org/T119727) (owner: 10Dereckson)
[09:08:49] <icinga-wm>	 RECOVERY - Disk space on elastic1016 is OK: DISK OK
[09:12:54] <grrrit-wm>	 (03PS1) 10Dereckson: Enable NewUserMessage on ps.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258672 (https://phabricator.wikimedia.org/T121132) 
[09:13:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2016 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 24953
[09:14:12] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on db2016 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[09:14:16] <godog>	 heh
[09:14:22] <akosiaris>	 wat ?
[09:14:33] <akosiaris>	 how did it get fixed ?
[09:14:38] <jynus>	 I fixed it
[09:14:43] <_joe_>	 akosiaris: it's a large query
[09:15:01] <godog>	 !log move old elasticsearch logs on elastic1026 out to /var/lib/elasticsearch/log (/ is full)
[09:15:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:15:08] <_joe_>	 the "seconds behind master" is a calculation made by mysql, it's not a real number
[09:15:14] <akosiaris>	 _joe_: I don't think so... it said slave sql runing: no 
[09:15:19] <_joe_>	 godog: we should make logrotate more aggressive?
[09:15:20] <jynus>	 no, it is something writing on the ops database, that broke replication
[09:15:25] <_joe_>	 slave_sql_lag Seconds_Behind_Master: 24953
[09:15:27] <akosiaris>	 _joe_: that ^
[09:15:35] <jynus>	 the lag is real
[09:15:50] <_joe_>	 jynus: the ops database?
[09:15:54] <akosiaris>	 _joe_: it doesn't mean it will take 24953 seconds to catch up
[09:16:02] <_joe_>	 yup exactly
[09:16:07] <jynus>	 _joe_, you know the same I do
[09:16:09] <jynus>	 :-)
[09:16:09] <godog>	 _joe_: not sure yet what's the right solution, it is the current indexing slowlog that's big
[09:16:13] <akosiaris>	 just that since the last time it successully replayed something 24953 sec have passed
[09:16:21] <_joe_>	 godog: ah, known problem
[09:16:39] <icinga-wm>	 RECOVERY - Disk space on elastic1026 is OK: DISK OK
[09:16:39] <akosiaris>	 jynus: ok, so I assume some weird query ?
[09:16:41] <_joe_>	 godog: I think david had a ticket tracking the upstream issue
[09:16:43] <jynus>	 yes, it should take less than that, that is the current delay
[09:16:50] <jynus>	 akosiaris, I have to investigate an fix it
[09:17:21] <akosiaris>	 ok. what was the temporary fix ? sql slave skip counter =1 ?
[09:17:28] <_joe_>	 it depends
[09:17:30] <jynus>	 in this case
[09:17:45] <jynus>	 where there was a write to a non existent table on the slave
[09:17:52] <akosiaris>	 lol
[09:17:53] <jynus>	 and a table that is not useful
[09:18:03] <jynus>	 yes, skip was ok
[09:18:10] <jynus>	 normally that is not ok
[09:18:10] <akosiaris>	 good to know it was what I would have done 
[09:18:19] <jynus>	 do not do it generally
[09:18:21] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 23745
[09:18:27] <jynus>	 that breaks replication even more
[09:18:27] <akosiaris>	 oh I know better than that
[09:18:39] <_joe_>	 jynus: how do you know that table is not useful?
[09:18:40] <akosiaris>	 I usually look very carefully at the query before doing that
[09:18:53] <akosiaris>	 so, db2034 is the exact same thing I assume ?
[09:19:09] <jynus>	 I'm ackin all those
[09:19:26] <jynus>	 when it broke on db2016, it broke on all of codfw
[09:19:43] <jynus>	 _joe_, the ops database is not real data, AFAIK
[09:19:53] <_joe_>	 lol :)
[09:20:01] <jynus>	 it is supposed to be a local database in the past for query profiling
[09:20:16] <jynus>	 the fact that it is being replicated is a flaw in the logic
[09:20:22] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[09:20:30] <godog>	 _joe_: heh, do you know offhand how to search (ah ah) for it?
[09:20:31] <jynus>	 and I have to found why it was created
[09:22:43] <_joe_>	 godog: nope
[09:22:46] <_joe_>	 godog: I can try
[09:24:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is failing due to the role:: namespace being handled both by the module and the import "manifests/role/*.pp" statement if this patch " [puppet] - 10https://gerrit.wikimedia.org/r/249059 (owner: 10Dzahn)
[09:25:15] <_joe_>	 godog: https://phabricator.wikimedia.org/T117181
[09:25:41] <jynus>	 this is an important bug, that would have broken the wikis, if it had been eqiad
[09:26:42] <_joe_>	 akosiaris: oh so the original phab issue on this is wrong?
[09:27:16] <akosiaris>	 _joe_:  sorry ? not following ...
[09:27:47] <_joe_>	 the Ps you just commented on
[09:27:57] <akosiaris>	 there's a phab for that ?
[09:28:39] <_joe_>	 yup, gimme a sec
[09:28:46] <godog>	 _joe_: ah thanks! I'll piggyback on that
[09:31:48] <wikibugs>	 6operations, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Elasticsearch index indexing slow log generates too much data - https://phabricator.wikimedia.org/T117181#1875146 (10fgiunchedi)
[09:32:58] <akosiaris>	 so, Connection: Close doesn't seem to imply anything to etherpad ...
[09:33:43] <wikibugs>	 6operations, 6Commons, 10Wikimedia-Media-storage: image magick stripping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#1875147 (10Nemo_bis)
[09:36:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2034 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 17328
[09:36:56] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2055 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 17328
[09:37:42] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2069 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 17007
[09:40:55] <grrrit-wm>	 (03CR) 10Aklapper: ""not ready for merge yet, a bug to be worked out" - if that is still true, feel encouraged to add a [WIP] prefix to the patch summary so t" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke)
[09:42:39] <icinga-wm>	 PROBLEM - puppet last run on mw2149 is CRITICAL: CRITICAL: puppet fail
[09:43:08] <YuviPanda>	 hi
[09:43:10] <YuviPanda>	 more LDAP woes
[09:44:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db2048 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 14564
[09:44:39] <jynus>	 !log recreated events on db1057 with sql_bin_log = 0 and restarted replication on db2016
[09:44:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:45:22] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2034 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[09:45:26] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[09:45:28] <jynus>	 so here is the thing
[09:46:12] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2069 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[09:46:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2048 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[09:46:38] <jynus>	 events, that may be actually useful, but I would prefer to have on an agent and puppetized will kill long-running queries
[09:46:47] <jynus>	 and idling connections
[09:46:58] <jynus>	 that creates a log to a table
[09:47:03] <jynus>	 and that table is purged
[09:47:09] <jynus>	 but that was replicated
[09:47:49] <jynus>	 meaning that at some point table was purged, but that made replication explode
[09:48:12] <jynus>	 I recereated the events so that they do not write to the slave's binary log
[09:48:27] <jynus>	 but all servers should be checked for these events
[09:48:48] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "LGTM, do you have this running already somewhere to peek at a sample of metrics?" [puppet] - 10https://gerrit.wikimedia.org/r/258491 (owner: 10Alexandros Kosiaris)
[09:49:42] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s1 on db2016 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table ops.event_log: Cant find record in event_log, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1057-bin.001282, end_log_pos 300174534
[09:50:32] <jynus>	 I wil have to skip, however, errors like ^ created hours ago
[09:51:07] <jynus>	 do not skip errors on tables outside of the ops database, which is not very useful
[09:52:56] <jynus>	 better that we caught this now than before it was on full production
[10:00:59] <dcausse>	 !log elastic in eqiad: disabling TRACE indexing slowlog for urwiki_content
[10:01:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:02:22] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on db2016 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[10:10:08] <icinga-wm>	 RECOVERY - puppet last run on mw2149 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[10:12:35] <godog>	 !log bounce nslcd on tools-submit and stop puppet
[10:12:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:15:00] <grrrit-wm>	 (03CR) 10Aklapper: "@ArielGlenn: This patch has seen no activity for six months. How to get this reviewed/merged or some progress here so this won't just rot?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219845 (owner: 10ArielGlenn)
[10:15:05] <grrrit-wm>	 (03CR) 10Aklapper: "@ArielGlenn: This patch has seen no activity for six months. How to get this reviewed/merged or some progress here so this won't just rot?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219852 (owner: 10ArielGlenn)
[10:15:10] <grrrit-wm>	 (03CR) 10Aklapper: "@ArielGlenn: This patch has seen no activity for six months. How to get this reviewed/merged or some progress here so this won't just rot?" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219841 (https://phabricator.wikimedia.org/T103013) (owner: 10ArielGlenn)
[10:32:59] <wikibugs>	 6operations, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Elasticsearch index indexing slow log generates too much data - https://phabricator.wikimedia.org/T117181#1875217 (10dcausse) Thanks, Joe submitted a puppet patch few month ago but we haven't restarted the nodes yet. The workaround today is to disable...
[10:47:34] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-Logging: Error: 2013 Lost connection to MySQL server during query on IndexPager::buildQueryInfo (LogPager) - https://phabricator.wikimedia.org/T121306#1875234 (10Josve05a)
[10:48:42] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on db2016 is OK: OK slave_sql_lag Seconds_Behind_Master: 0
[10:49:47] <jynus>	 that is the end of the issue, it will take hours to confirm that it has been fixed permanently
[10:51:10] <paravoid>	 thanks
[10:52:48] <paravoid>	 these pages are a bit too verbose
[10:53:11] <paravoid>	 a page for each codfw slave that's lagging behind is probably too much :)
[10:57:26] <jynus>	 well, it is not trivial- if the master lags, all lag, but they can lag individually to
[10:57:35] <jynus>	 problem is that topology is dynamic
[10:57:42] <jynus>	 and not puppet-dependent
[10:57:57] <jynus>	 so it is not trivial to create nagios dependencies
[10:59:09] <jynus>	 I suppose a check could be created where not become critical if its master is critical, but again that is not dynamic
[11:00:04] <jynus>	 BTW, I have not received a single SMS
[11:00:11] <paravoid>	 I got them all..
[11:00:44] <paravoid>	 I woke up like 4 times :P
[11:01:09] <YuviPanda>	 paravoid sleeps?
[11:01:13] <YuviPanda>	 TIL
[11:03:36] <paravoid>	 ldap connections are holding steady, btw: https://graphite.wikimedia.org/render/?width=586&height=308&target=servers.serpens.openldap.conns.current&target=servers.seaborgium.openldap.conns.current
[11:05:21] <YuviPanda>	 paravoid: lots of errors from nslcd in syslogs tho
[11:16:56] <jynus>	 I think the best way is to not page if a server is depooled
[11:17:30] <jynus>	 but that requires to pull from git on every check
[11:19:31] <jynus>	 but it would be nice to integrate it with systemd - "this server is pooled, are you sure you want to stop it?"
[13:26:37] <icinga-wm>	 PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: puppet fail
[13:52:09] <icinga-wm>	 RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[13:57:35] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) (owner: 10Dereckson)
[13:58:18] <Luke081515>	 Can a member of the SWAT Team look for that patch? He needs deployment till tomorow
[14:05:45] <andre__afk>	 SWAT on that level isn't really ops territory :)
[14:44:48] <icinga-wm>	 PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: puppet fail
[14:57:16] <MGChecker>	 ori: You have done a change in syntax highlight this month: https://gerrit.wikimedia.org/r/#/c/256577/
[14:57:45] <MGChecker>	 After setting up a wiki with this change, I get the error  „Class undefined: Symfony\Component\Process\ProcessBuilder”
[15:12:57] <icinga-wm>	 RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[15:18:03] <Reedy>	 Run composer
[15:22:10] <andre__afk>	 Run, Composer, run!
[16:00:35] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Remove empty manifests/role/openldap.pp [puppet] - 10https://gerrit.wikimedia.org/r/258486 
[16:00:57] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Remove empty manifests/role/openldap.pp [puppet] - 10https://gerrit.wikimedia.org/r/258486 (owner: 10Alexandros Kosiaris)
[16:11:47] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[16:15:38] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[16:23:43] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: varnish: allow etherpad to use websockets [puppet] - 10https://gerrit.wikimedia.org/r/258439 
[16:53:39] <icinga-wm>	 PROBLEM - puppet last run on db2001 is CRITICAL: CRITICAL: puppet fail
[16:59:34] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) (owner: 10Dereckson)
[16:59:57] <grrrit-wm>	 (03Merged) 10jenkins-bot: Throttle rule for Hyderabad photo event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258669 (https://phabricator.wikimedia.org/T121303) (owner: 10Dereckson)
[17:01:13] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/258669/ (duration: 00m 29s)
[17:01:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:19:19] <icinga-wm>	 RECOVERY - puppet last run on db2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[17:20:25] <grrrit-wm>	 (03CR) 10Florianschmidtwelzow: [C: 031] Don't index User namespace on ko.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258667 (https://phabricator.wikimedia.org/T121301) (owner: 10Revi)
[17:56:38] <icinga-wm>	 PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail
[18:10:48] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail
[18:24:17] <icinga-wm>	 RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:38:28] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[19:11:48] <icinga-wm>	 PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail
[19:30:38] <hashar>	 !log Restarted Zuul
[19:30:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:39:28] <icinga-wm>	 RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:12:20] <grrrit-wm>	 (03PS1) 10Ori.livneh: Mark rdb1006 as a slave of rdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/258691 (https://phabricator.wikimedia.org/T119543) 
[20:12:52] <grrrit-wm>	 (03PS2) 10Ori.livneh: Mark rdb1006 as a slave of rdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/258691 (https://phabricator.wikimedia.org/T119543) 
[20:13:00] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Mark rdb1006 as a slave of rdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/258691 (https://phabricator.wikimedia.org/T119543) (owner: 10Ori.livneh)
[20:19:44] <grrrit-wm>	 (03PS1) 10Ori.livneh: Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 
[20:19:58] <icinga-wm>	 PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: puppet fail
[20:21:04] <hashar>	 ori I like your puppet magic:  <| |>
[20:21:35] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 (owner: 10Ori.livneh)
[20:22:34] <ori>	 i'm actually a bit mystified
[20:23:41] <hashar>	 that diamond operator is probably way over my league
[20:23:55] <hashar>	 I will get another glass of Bourgogne instead
[20:24:10] <ori>	 it's not a diamond, it's a spaceship! :)
[20:24:11] <ori>	 https://docs.puppetlabs.com/puppet/latest/reference/lang_collectors.html
[20:24:16] <ori>	 but that sounds like a better plan
[20:24:49] <grrrit-wm>	 (03PS2) 10Ori.livneh: Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 
[20:25:25] <hashar>	 I wonder whether that would work on labs
[20:25:32] <hashar>	 I don't think puppet has collection enabled on labs
[20:25:49] <hashar>	 maybe puppetmaster::self does though
[20:25:57] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Ensure /srv/redis is created before starting redis [puppet] - 10https://gerrit.wikimedia.org/r/258692 (owner: 10Ori.livneh)
[20:27:15] <ori>	 i don't think it's possible to disable resource collectors
[20:27:55] <hashar>	 at least deployment-redis{01,02} are passing puppet
[20:28:21] <ori>	 yeah i'm just checking on deployment-redis02
[20:30:28] * hashar refills with more Bourgogne
[20:33:48] <grrrit-wm>	 (03PS1) 10Ori.livneh: redis: explicitly declare 'daemonize no' for each instance [puppet] - 10https://gerrit.wikimedia.org/r/258693 
[20:35:44] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] redis: explicitly declare 'daemonize no' for each instance [puppet] - 10https://gerrit.wikimedia.org/r/258693 (owner: 10Ori.livneh)
[20:43:08] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: http status 500
[20:43:18] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: http status 500
[20:43:54] <hashar>	 doh
[20:44:05] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 
[20:44:10] <hashar>	 ori: could it have impacted ocg ?
[20:44:44] <ori>	 i'll check
[20:45:08] <icinga-wm>	 RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 147 msg: ocg_render_job_queue 0 msg
[20:45:17] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 170 msg: ocg_render_job_queue 0 msg
[20:45:50] <ori>	 it looks like ocg crashes if its redis server is unavailable even briefly
[20:45:56] <ori>	 it then recovers
[20:45:58] <ori>	 still pretty shitty
[20:46:09] <ori>	 anyways, fine now, didn't need to intervene
[20:46:24] <hashar>	 we had a spike of errors
[20:46:43] <hashar>	 even on wikis with stuff like "Could not insert 1 enqueue job(s)."
[20:46:46] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add rdb100[5-6] to job runner configuration [puppet] - 10https://gerrit.wikimedia.org/r/258696 (https://phabricator.wikimedia.org/T119543) 
[20:46:49] <hashar>	 guess MediaWiki doesn't like it either
[20:47:38] <icinga-wm>	 RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[20:47:54] * ori doesn't see a spike on https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)
[20:48:49] <hashar>	 looks like it was on jobs
[20:48:53] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add rdb100[5-6] to job runner configuration [puppet] - 10https://gerrit.wikimedia.org/r/258696 (https://phabricator.wikimedia.org/T119543) (owner: 10Ori.livneh)
[20:48:57] <hashar>	 guess they will just be retried later on
[20:50:24] <grrrit-wm>	 (03PS1) 10Ori.livneh: Fix-up for I84fe3a2638 [puppet] - 10https://gerrit.wikimedia.org/r/258697 
[20:50:35] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I84fe3a2638 [puppet] - 10https://gerrit.wikimedia.org/r/258697 (owner: 10Ori.livneh)
[20:52:51] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 
[20:54:33] <grrrit-wm>	 (03PS3) 10Ori.livneh: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 
[21:00:06] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 (owner: 10Ori.livneh)
[21:00:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add rdb100[5-6] to job queue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258695 (owner: 10Ori.livneh)
[21:02:57] <logmsgbot>	 !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I53f13a159: Add rdb100[5-6] to job queue configuration (duration: 00m 31s)
[21:03:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:03:44] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1875793 (10ori) 5Open>3Resolved
[21:06:29] <hashar>	 have a good afternoon!
[21:09:28] <ShakespeareFan00>	 Hi
[21:09:45] <ShakespeareFan00>	 Purging isnt working - https://en.wikisource.org/w/index.php?title=Index:A_Voyage_in_Space_%281913%29.djvu&redirects=1
[21:10:02] <ShakespeareFan00>	 And there a pages shown in red that should be in Yellow
[21:10:08] <dmacks>	 I'm starting to get this after various edits and admin actions: [6067fa51] 2015-12-12 21:07:35: Fatal exception of type "MWException"
[21:10:09] <icinga-wm>	 PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Puppet has 2 failures
[21:10:24] <ShakespeareFan00>	 which suggests page status and link table aren't updating.
[21:10:58] <icinga-wm>	 PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:10:58] <ShakespeareFan00>	 Was something updated recently, or perhpas "something may have gone wrong"
[21:11:58] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 970 bytes in 0.061 second response time
[21:12:17] <ori>	 should sort itself out in a moment
[21:14:47] <icinga-wm>	 PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Puppet has 3 failures
[21:17:49] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1470 bytes in 0.325 second response time
[21:17:58] <icinga-wm>	 PROBLEM - puppet last run on mc2012 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:18:19] <icinga-wm>	 PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:18:19] <icinga-wm>	 PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:18:27] <icinga-wm>	 PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: puppet fail
[21:18:57] <icinga-wm>	 PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: puppet fail
[21:20:08] <icinga-wm>	 PROBLEM - puppet last run on mc2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:20:27] <icinga-wm>	 PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Puppet has 4 failures
[21:22:25] <grrrit-wm>	 (03PS1) 10Ori.livneh: Fix-up for Ibc46006e17: daemonize [puppet] - 10https://gerrit.wikimedia.org/r/258703 
[21:22:28] <icinga-wm>	 PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:22:38] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Ibc46006e17: daemonize [puppet] - 10https://gerrit.wikimedia.org/r/258703 (owner: 10Ori.livneh)
[21:23:07] <jzerebecki>	 hashar: I was told using resource collectors on labs is a security risk... never checked if and to what extend that is true.
[21:23:24] <hashar>	 ori:  we also have 2000 errors per seconds related to rdb1007  such as "Lua script error on server "rdb1007.eqiad.wmnet:6379": LOADING Redis is loading the dataset in memory"
[21:23:46] <ori>	 yeah, that was when it was restarted. it should be ok now
[21:23:46] <hashar>	 jzerebecki: i think the issue is that the resources from your project are sent to the labs wide puppetmaster 
[21:23:58] <hashar>	 jzerebecki: thus your project resources could be reached by other projects
[21:25:10] <hashar>	 ori: comes in spike every 1 minutes and 30 seconds
[21:25:24] <ori>	 you're confusing https://docs.puppetlabs.com/puppet/latest/reference/lang_exported.html with resource collectors
[21:25:32] <hashar>	 on all redis instances of rdb1007
[21:25:43] <hashar>	 oh
[21:26:00] <legoktm>	 !log running fixDefaultJsonContentPages.php on all wikis (T108663)
[21:26:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:26:10] <hashar>	 they both have  <<| |>>
[21:26:49] <ori>	 no, one is <| |>
[21:27:38] <jzerebecki>	 ah yes I confused those
[21:27:59] <jzerebecki>	 *sigh* wikidata dispatching broke
[21:28:24] <hashar>	 maybe due to job runners redis? 
[21:28:53] <jzerebecki>	 yea
[21:29:08] <icinga-wm>	 PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:29:15] <hashar>	 anyway. Gotta sleep
[21:29:21] <hashar>	 and recover from the week
[21:29:33] <jzerebecki>	 have a good weekend
[21:29:43] <hashar>	 jzerebecki: I will get the Jenkins tmpfs  issue sorted out eventually :(
[21:30:10] <hashar>	 jzerebecki: pretty sure it is the mwext-selenium-mw job being the cause because of SKIP_TMPFS.   Will try to reproduce on Monday
[21:30:16] <hashar>	 have a good week-end!
[21:30:18] <icinga-wm>	 RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[21:30:23] <jzerebecki>	 hashar: maybe my patch would fix that
[21:30:30] <hashar>	 yeha
[21:31:08] <icinga-wm>	 PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:31:57] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:32:18] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:32:25] <hashar>	 jzerebecki: i will probably get rid of the skip tmpfs entirely .  We will see
[21:34:19] <icinga-wm>	 PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: Puppet has 4 failures
[21:36:28] <icinga-wm>	 PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:36:48] <icinga-wm>	 RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:37:08] <icinga-wm>	 RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[21:40:09] <icinga-wm>	 RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:40:17] <icinga-wm>	 PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Puppet has 3 failures
[21:40:27] <icinga-wm>	 RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:40:39] <icinga-wm>	 RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[21:40:48] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[21:41:57] <icinga-wm>	 RECOVERY - puppet last run on mc2016 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[21:43:38] <icinga-wm>	 PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:44:38] <icinga-wm>	 PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: puppet fail
[21:44:47] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[21:44:57] <icinga-wm>	 RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[21:44:58] <icinga-wm>	 PROBLEM - puppet last run on rdb2001 is CRITICAL: CRITICAL: Puppet has 3 failures
[21:45:37] <icinga-wm>	 RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[21:45:38] <icinga-wm>	 RECOVERY - puppet last run on mc2012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[21:45:39] <icinga-wm>	 RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[21:45:59] <icinga-wm>	 RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[21:45:59] <icinga-wm>	 PROBLEM - puppet last run on mc2009 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:46:17] <icinga-wm>	 RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:46:17] <icinga-wm>	 PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:46:18] <icinga-wm>	 RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:46:29] <icinga-wm>	 RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:46:58] <icinga-wm>	 PROBLEM - puppet last run on mc2003 is CRITICAL: CRITICAL: Puppet has 1 failures
[21:47:58] <icinga-wm>	 RECOVERY - puppet last run on mc2009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[21:47:59] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[21:48:09] <icinga-wm>	 RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[21:48:58] <icinga-wm>	 RECOVERY - puppet last run on mc2003 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[21:49:37] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:59] <icinga-wm>	 RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[21:49:59] <icinga-wm>	 RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[21:50:57] <icinga-wm>	 RECOVERY - puppet last run on rdb2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:55:31] <grrrit-wm>	 (03PS1) 10Ori.livneh: Use correct partition names for rdb4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258709 
[21:55:56] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Use correct partition names for rdb4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258709 (owner: 10Ori.livneh)
[21:56:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: Use correct partition names for rdb4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258709 (owner: 10Ori.livneh)
[21:57:59] <icinga-wm>	 RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[21:58:04] <grrrit-wm>	 (03PS1) 10Aaron Schulz: Fixed duplicate rdb3 keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258710 
[21:58:47] <grrrit-wm>	 (03Abandoned) 10Ori.livneh: Fixed duplicate rdb3 keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258710 (owner: 10Aaron Schulz)
[22:01:14] <grrrit-wm>	 (03CR) 10Ori.livneh: "the 'ń' broke puppet on rutherfordium: Error: Could not convert change 'comment' to string: incompatible character encodings: UTF-8 and AS" [puppet] - 10https://gerrit.wikimedia.org/r/256000 (https://phabricator.wikimedia.org/T119404) (owner: 10ArielGlenn)
[22:03:48] <grrrit-wm>	 (03PS1) 10Ori.livneh: ASCII-fi name in admin/data/data.tml [puppet] - 10https://gerrit.wikimedia.org/r/258713 
[22:04:08] <grrrit-wm>	 (03PS2) 10Ori.livneh: ASCII-fi name in admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/258713 
[22:04:37] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] ASCII-fi name in admin/data/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh)
[22:06:38] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[22:12:56] <grrrit-wm>	 (03PS1) 10Ori.livneh: Use rdb1005 as the primary job queue aggregator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258714 
[22:31:37] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.0.223:9042 on restbase1007 is CRITICAL: Connection refused
[23:04:32] <grrrit-wm>	 (03CR) 10Alex Monk: "Isn't the correct fix to make sure it expects utf-8 instead?" [puppet] - 10https://gerrit.wikimedia.org/r/258713 (owner: 10Ori.livneh)
[23:28:58] <grrrit-wm>	 (03PS1) 10Yuvipanda: diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 
[23:29:56] <grrrit-wm>	 (03PS2) 10Ori.livneh: diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda)
[23:30:50] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 04-1] "This will break, because init.pp includes some collectors" [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda)
[23:35:58] <grrrit-wm>	 (03PS3) 10Ori.livneh: diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda)
[23:38:32] <YuviPanda>	 ori: thanks :D
[23:42:04] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] "catalog compiler confirms it's a no-op on already-provisioned hosts (https://puppet-compiler.wmflabs.org/1485/)" [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda)
[23:58:32] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] diamond: Fix puppet failures on first run [puppet] - 10https://gerrit.wikimedia.org/r/258719 (owner: 10Yuvipanda)