[00:14:51] <wikibugs>	 (03PS1) 10Gergő Tisza: Replace libmysqlclient-dev with default-libmysqlclient-dev [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/394818 (https://phabricator.wikimedia.org/T51652)
[00:16:36] <wikibugs>	 (03PS2) 10Gergő Tisza: Replace libmysqlclient-dev with default-libmysqlclient-dev [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/394818 (https://phabricator.wikimedia.org/T51652)
[00:21:36] <wikibugs>	 (03CR) 10BryanDavis: [C: 031] "LGTM, but needs a prod root to +2" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/394818 (https://phabricator.wikimedia.org/T51652) (owner: 10Gergő Tisza)
[00:44:09] <wikibugs>	 (03CR) 10TerraCodes: [C: 031] Disable DisableAccount on wikis where there are no disabled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338792 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy)
[00:44:58] <wikibugs>	 (03PS1) 10Divadsn: Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684)
[01:01:11] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806406 (10StevenJ81) This is crazy. The whole reason this version of the task asks to set up with redirects is that we were told that it would be...
[01:04:56] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806407 (10Reedy) We're not saying we necessarily need to delete them. But there's certainly cleanup that would have to be done in one way or another
[01:05:25] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806408 (10Reedy) Basically, it's not just as simple as "redirect them"
[01:07:51] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:16:10] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806423 (10StevenJ81) Fine. What LangCom wants, and what the Board has approved, is for these projects to functionally disappear (to the public) i...
[01:32:51] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[02:06:11] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received
[02:08:02] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[02:17:18] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806460 (10StevenJ81) Fine. What LangCom wants, and what the Board has approved, is for these projects to functionally disappear (to the public) i...
[02:36:42] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[02:36:51] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:57:51] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[02:58:01] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[03:06:01] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[03:06:02] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:06:46] <wikibugs>	 10Operations, 10Electron-PDFs, 10Design, 10I18n, and 3 others: Use "Charter" as preferred typeface on Electron - https://phabricator.wikimedia.org/T181200#3806464 (10Nirzar) @mobrovac it worked when you had put it in beta cluster :( > https://phabricator.wikimedia.org/T181200#3783404
[03:24:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 812.80 seconds
[03:27:11] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[03:27:12] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[03:33:32] <icinga-wm>	 PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:36:11] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[03:36:21] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[03:38:49] <wikibugs>	 (03PS1) 10Jon Harald Søby: Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082)
[03:55:57] <no_justification>	 Hmm, icgina won't allow me to ack that ^
[03:57:02] <no_justification>	 !log gerrit2001: icinga is flapping on the gerrit process/systemd check, but this is kind of known (not sure why it's doing this all of a sudden). It's not letting me acknowledge it, but it's fine/harmless. Cf T176532
[03:57:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:57:14] <stashbot>	 T176532: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532
[03:57:22] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[03:57:32] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[03:59:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[04:00:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw2132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.123 second response time
[04:03:32] <icinga-wm>	 RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:05:31] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[04:05:32] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:10:52] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 282.62 seconds
[04:27:32] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[04:27:42] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[04:35:41] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[04:35:51] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[04:57:52] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[04:58:02] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[05:05:04] <icinga-wm>	 PROBLEM - MariaDB disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 255015 MB (5% inode=99%)
[05:06:01] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[05:06:11] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:11:11] <andrewbogott>	 !log deleting files on labsdb1003 /srv/tmp older than 30 days
[05:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:22] <icinga-wm>	 PROBLEM - Disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 168111 MB (3% inode=99%)
[05:28:12] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[05:28:22] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[05:29:31] <icinga-wm>	 RECOVERY - Disk space on labsdb1003 is OK: DISK OK
[05:30:15] <icinga-wm>	 RECOVERY - MariaDB disk space on labsdb1003 is OK: DISK OK
[05:36:22] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[05:36:31] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:57:31] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[05:57:41] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[06:06:32] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[06:06:41] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:14:11] <icinga-wm>	 PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 12870.68 ms
[06:14:21] <icinga-wm>	 RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.84 ms
[06:16:22] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2107803
[06:27:42] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[06:27:52] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[06:27:52] <icinga-wm>	 PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py]
[06:28:21] <icinga-wm>	 PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt]
[06:31:02] <icinga-wm>	 PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/nginx]
[06:36:01] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:36:45] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[06:53:06] <marostegui>	 andrewbogott: replication was still stop on labsdb1003, I have started it
[06:56:01] <icinga-wm>	 RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:57:01] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[06:57:02] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[06:57:52] <icinga-wm>	 RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:58:12] <icinga-wm>	 RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:06:02] <icinga-wm>	 PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[07:06:12] <icinga-wm>	 PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:14:59] <marostegui>	 no_justification: I have downtimed the check ^ till tomorrow as per your SAL entry
[07:17:09] <no_justification>	 marostegui: Tyvm
[07:21:11] <icinga-wm>	 PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:21:16] <icinga-wm>	 PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused
[07:21:31] <icinga-wm>	 PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed
[07:27:22] <icinga-wm>	 RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational
[07:28:21] <icinga-wm>	 RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site
[07:42:43] * akosiaris looking into conf2002
[07:43:21] <icinga-wm>	 RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational
[07:43:26] <icinga-wm>	 RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time
[07:43:42] <icinga-wm>	 RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active
[07:44:05] <apergos>	 what's up?
[07:44:33] <akosiaris>	 !log ran puppet on conf2002, etcdmirror-conftool-eqiad-wmnet got started again
[07:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:57] <akosiaris>	 Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure
[07:45:05] <akosiaris>	 raft crapped it's pants for while and 
[07:45:14] <akosiaris>	 Dec 03 07:18:45 conf2002 etcdmirror-conftool-eqiad-wmnet[14990]: [etcd-mirror] CRITICAL: Generic error: unsupported operand type(s) for +: 'NoneType' and 'int'
[07:45:17] <akosiaris>	 there's your bug 
[07:45:29] <akosiaris>	 without a stacktrace unfortunately
[07:46:15] <akosiaris>	 I 'll create a quick task on phab about it
[07:46:25] <akosiaris>	 and then be on my way
[07:46:35] <apergos>	 k
[07:46:38] <apergos>	 happy weekend
[07:50:17] <wikibugs>	 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#3806629 (10akosiaris)
[08:02:22] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:02:32] <icinga-wm>	 PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100%
[08:03:01] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:03:11] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms
[08:03:12] <icinga-wm>	 RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms
[08:03:12] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 11394.88 ms
[08:03:12] <icinga-wm>	 PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:03:21] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 2.52 ms
[08:03:21] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 2.43 ms
[08:03:22] <icinga-wm>	 RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms
[08:03:22] <icinga-wm>	 PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:41] <icinga-wm>	 RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 2.28 ms
[08:21:42] <icinga-wm>	 PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:11] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:12] <icinga-wm>	 PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:31] <icinga-wm>	 PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:32] <icinga-wm>	 PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:32] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:32] <icinga-wm>	 PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:41] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Zotero alive) timed out before a response was received
[08:22:42] <icinga-wm>	 PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:22:51] <legoktm>	 o.O
[08:23:02] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:23:21] <icinga-wm>	 PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:23:32] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy
[08:23:42] <icinga-wm>	 PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:24:10] <legoktm>	 I guess ganeti1006 is down
[08:25:29] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboot of dumps hosts - https://phabricator.wikimedia.org/T180127#3806669 (10ArielGlenn) 05Open>03Resolved dataset1001 done, closing.
[08:25:31] <icinga-wm>	 RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 2.31 ms
[08:25:41] <icinga-wm>	 RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 1.98 ms
[08:25:41] <icinga-wm>	 RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms
[08:25:41] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms
[08:25:41] <icinga-wm>	 RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms
[08:25:41] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms
[08:25:42] <icinga-wm>	 RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms
[08:25:42] <icinga-wm>	 RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms
[08:25:43] <icinga-wm>	 RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms
[08:26:12] <icinga-wm>	 RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[08:26:41] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5
[08:28:50] <wikibugs>	 (03PS4) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895)
[08:29:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn)
[08:30:51] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[08:31:47] <wikibugs>	 (03PS5) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895)
[08:38:00] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn)
[08:43:42] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5
[08:44:47] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806675 (10Strainu) @Reedy , I think that the most important is to know if the Ops team is actively working on a solution for this and if not, whe...
[09:01:47] <wikibugs>	 (03PS1) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895)
[09:02:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn)
[09:03:39] <wikibugs>	 (03PS2) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895)
[09:04:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn)
[09:04:59] <elukey>	 the 50x were all due to piwik since bohrium went down
[09:07:24] <wikibugs>	 (03PS3) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895)
[09:07:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn)
[09:09:57] <wikibugs>	 (03PS4) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895)
[09:10:44] <wikibugs>	 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3806684 (10elukey) ganeti1006 just froze fora couple of minutes:  ``` [Sun Dec  3 08:48:11 2017] BUG: Bad page state in process qemu-system-x86  pfn:6d8600 [Sun Dec  3 08:48:11 20...
[09:19:14] <wikibugs>	 (03PS5) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895)
[10:33:10] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806717 (10MarcoAurelio) If the Board has approved the deletion of them, why don't we simply delete them the old way? Jesus...
[10:48:26] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3398639 (10MarcoAurelio)
[11:11:03] <wikibugs>	 (03PS1) 10MarcoAurelio: Delete mowiki and mowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923)
[11:13:52] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_props: Cant find record in page_props, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001571, end_log_pos 27136838
[11:17:41] <TabbyCat>	 marostegui, andas por ahí?
[11:26:02] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 881.47 seconds
[12:06:12] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[12:06:17] <marostegui>	 !log Fix dbstore1002 replication
[12:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:26] <marostegui>	 TabbyCat: hola hola ;.)
[12:13:55] <TabbyCat>	 :D
[12:14:16] <TabbyCat>	 marostegui, thanks so much for de queries. It's funny that localnames count == localusers apparently
[12:14:20] <TabbyCat>	 *the
[12:16:06] <marostegui>	 Yeah, just replied. Can't help much there as I don't know the details for the internals of those tables
[12:16:19] <marostegui>	 I guess we need someone with more MW knoweldge 
[12:16:25] <marostegui>	 TabbyCat: Voy a comer! Hasta luego!
[12:16:29] <wikibugs>	 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3806885 (10akosiaris) The memtester did not trigger anything since Dec 3. It's already on loop 7 and given we need the capacity to safely empty ganeti1006, I had to stop it to mig...
[12:16:34] <TabbyCat>	 marostegui, que aproveche, no te atragantes
[12:16:51] <marostegui>	 TabbyCat: Lo intentaré, pero no prometo nada! :-)
[12:16:55] <TabbyCat>	 xD
[12:17:01] <TabbyCat>	 que me quedo sin mi dba favorito
[12:17:01] <TabbyCat>	 xD
[12:17:09] <marostegui>	 Cuando hay hambre...se come muy rapido y te puedes atrangatar
[12:17:10] <TabbyCat>	 (now that jynus isn't watching)
[12:17:14] <marostegui>	 haha
[12:17:18] <TabbyCat>	 hehe
[12:17:25] <akosiaris>	 !log empty ganeti1006, it had issues this morning per T181121
[12:17:27] <marostegui>	 Hasta luego!
[12:17:32] <TabbyCat>	 ciao
[12:17:33] <akosiaris>	 damn hardware... 
[12:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:34] <stashbot>	 T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121
[12:17:59] <akosiaris>	 only ganeti1007 hasn't exhibited anything yet
[12:18:32] <marostegui>	 akosiaris: right, you just triggered it by saying it
[12:20:22] <akosiaris>	 heh, at least that will be reassuring in a way
[12:20:47] <akosiaris>	 all hardware will be easily declared problematic by then. Given they are the same bunch at least
[12:21:21] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 82.60 seconds
[12:27:25] <akosiaris>	 it's a different issue this time around though... or at least the logs are different
[12:27:51] <akosiaris>	 anyway, I 've emptied the box, I 'll have a look later on
[13:28:52] <icinga-wm>	 PROBLEM - NTP on sca1004 is CRITICAL: NTP CRITICAL: Offset unknown
[13:58:52] <icinga-wm>	 RECOVERY - NTP on sca1004 is OK: NTP OK: Offset -0.01305687428 secs
[14:24:43] <wikibugs>	 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806990 (10Reedy) >>! In T169450#3806675, @Strainu wrote: > @Reedy , I think that the most important is to know if the Ops team is actively workin...
[14:49:35] <wikibugs>	 (03PS6) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895)
[14:58:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn)
[15:07:20] <wikibugs>	 10Operations: bond eth interfaces on ms1001 - https://phabricator.wikimedia.org/T89829#3807030 (10ArielGlenn) 05Open>03declined This task is obsolete, as ms1001 is heading for decommission, and labstore hosts will be picking up this role.
[15:07:24] <wikibugs>	 10Operations, 10Datasets-General-or-Unknown: Provide a good  download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#3807032 (10ArielGlenn)
[15:26:49] <wikibugs>	 (03PS1) 10ArielGlenn: move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852
[15:27:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (owner: 10ArielGlenn)
[15:33:10] <wikibugs>	 (03Abandoned) 10ArielGlenn: copy of completed dump files plus metadata from dumpsdata to web server [puppet] - 10https://gerrit.wikimedia.org/r/374606 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn)
[15:41:57] <wikibugs>	 (03Abandoned) 10ArielGlenn: setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn)
[15:46:22] <icinga-wm>	 PROBLEM - HHVM rendering on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:47:12] <icinga-wm>	 RECOVERY - HHVM rendering on mw2143 is OK: HTTP OK: HTTP/1.1 200 OK - 74110 bytes in 0.302 second response time
[16:26:21] <wikibugs>	 (03CR) 10Jon Harald Søby: "Per @Framawiki, the deployer needs to run this script after the patch has been merged:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby)
[16:37:41] <wikibugs>	 (03PS1) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935)
[16:37:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) (owner: 10ArielGlenn)
[21:21:42] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:50:22] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[21:51:42] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[22:27:01] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 4.779 second response time
[22:30:02] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:31:11] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.893 second response time
[22:34:12] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:02:22] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set
[23:02:42] <icinga-wm>	 RECOVERY - Disk space on labstore1006 is OK: DISK OK
[23:02:42] <icinga-wm>	 RECOVERY - DPKG on labstore1006 is OK: All packages OK
[23:03:01] <icinga-wm>	 RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational
[23:03:02] <icinga-wm>	 RECOVERY - configured eth on labstore1006 is OK: OK - interfaces up
[23:03:11] <icinga-wm>	 RECOVERY - dhclient process on labstore1006 is OK: PROCS OK: 0 processes with command name dhclient
[23:03:12] <icinga-wm>	 RECOVERY - Check size of conntrack table on labstore1006 is OK: OK: nf_conntrack is 0 % full
[23:04:31] <icinga-wm>	 RECOVERY - HP RAID on labstore1006 is OK: OK: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:1:10, 1E:1:11, 1E:1:12 - Controller: OK - Battery/Capacitor: OK
[23:08:52] <icinga-wm>	 RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[23:09:52] <icinga-wm>	 RECOVERY - IPMI Sensor Status on labstore1006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[23:11:46] <icinga-wm>	 PROBLEM - Check size of conntrack table on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - Check systemd state on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - DPKG on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - Disk space on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - configured eth on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - dhclient process on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:11:47] <icinga-wm>	 PROBLEM - puppet last run on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:14:16] <icinga-wm>	 PROBLEM - HP RAID on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:15:26] <icinga-wm>	 PROBLEM - IPMI Sensor Status on labstore1007 is CRITICAL: Return code of 255 is out of bounds
[23:20:46] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on labstore1006 is OK: OK: synced at Sun 2017-12-03 23:20:40 UTC.
[23:25:15] <icinga-wm>	 PROBLEM - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106)
[23:28:15] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 4.279 second response time
[23:30:55] <icinga-wm>	 RECOVERY - Long running screen/tmux on labstore1006 is OK: OK: No SCREEN or tmux processes detected.
[23:32:25] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:33:06] <wikibugs>	 (03PS1) 10EddieGP: varnish: Don't redirect www.$project.org on mobile [puppet] - 10https://gerrit.wikimedia.org/r/394902 (https://phabricator.wikimedia.org/T154026)
[23:35:25] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 6.773 second response time
[23:38:26] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:40:16] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.316 second response time
[23:54:20] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10Tgr)
[23:54:58] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807592 (10Tgr)