[00:14:51] (03PS1) 10Gergő Tisza: Replace libmysqlclient-dev with default-libmysqlclient-dev [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/394818 (https://phabricator.wikimedia.org/T51652) [00:16:36] (03PS2) 10Gergő Tisza: Replace libmysqlclient-dev with default-libmysqlclient-dev [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/394818 (https://phabricator.wikimedia.org/T51652) [00:21:36] (03CR) 10BryanDavis: [C: 031] "LGTM, but needs a prod root to +2" [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/394818 (https://phabricator.wikimedia.org/T51652) (owner: 10Gergő Tisza) [00:44:09] (03CR) 10TerraCodes: [C: 031] Disable DisableAccount on wikis where there are no disabled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338792 (https://phabricator.wikimedia.org/T106067) (owner: 10Reedy) [00:44:58] (03PS1) 10Divadsn: Add converted copyright svg images as png files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394820 (https://phabricator.wikimedia.org/T166684) [01:01:11] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806406 (10StevenJ81) This is crazy. The whole reason this version of the task asks to set up with redirects is that we were told that it would be... [01:04:56] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806407 (10Reedy) We're not saying we necessarily need to delete them. But there's certainly cleanup that would have to be done in one way or another [01:05:25] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806408 (10Reedy) Basically, it's not just as simple as "redirect them" [01:07:51] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:16:10] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806423 (10StevenJ81) Fine. What LangCom wants, and what the Board has approved, is for these projects to functionally disappear (to the public) i... [01:32:51] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [02:06:11] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [02:08:02] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [02:17:18] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806460 (10StevenJ81) Fine. What LangCom wants, and what the Board has approved, is for these projects to functionally disappear (to the public) i... [02:36:42] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [02:36:51] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:57:51] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [02:58:01] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [03:06:01] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [03:06:02] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:06:46] 10Operations, 10Electron-PDFs, 10Design, 10I18n, and 3 others: Use "Charter" as preferred typeface on Electron - https://phabricator.wikimedia.org/T181200#3806464 (10Nirzar) @mobrovac it worked when you had put it in beta cluster :( > https://phabricator.wikimedia.org/T181200#3783404 [03:24:41] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 812.80 seconds [03:27:11] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [03:27:12] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [03:33:32] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:36:11] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [03:36:21] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:38:49] (03PS1) 10Jon Harald Søby: Enable TemplateStyles extension on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394831 (https://phabricator.wikimedia.org/T176082) [03:55:57] Hmm, icgina won't allow me to ack that ^ [03:57:02] !log gerrit2001: icinga is flapping on the gerrit process/systemd check, but this is kind of known (not sure why it's doing this all of a sudden). It's not letting me acknowledge it, but it's fine/harmless. Cf T176532 [03:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:14] T176532: Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 [03:57:22] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [03:57:32] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [03:59:22] PROBLEM - Apache HTTP on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:00:21] RECOVERY - Apache HTTP on mw2132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.123 second response time [04:03:32] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:05:31] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [04:05:32] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:10:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 282.62 seconds [04:27:32] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [04:27:42] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [04:35:41] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [04:35:51] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:57:52] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [04:58:02] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [05:05:04] PROBLEM - MariaDB disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 255015 MB (5% inode=99%) [05:06:01] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [05:06:11] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:11:11] !log deleting files on labsdb1003 /srv/tmp older than 30 days [05:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:22] PROBLEM - Disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 168111 MB (3% inode=99%) [05:28:12] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [05:28:22] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [05:29:31] RECOVERY - Disk space on labsdb1003 is OK: DISK OK [05:30:15] RECOVERY - MariaDB disk space on labsdb1003 is OK: DISK OK [05:36:22] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [05:36:31] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:57:31] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [05:57:41] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [06:06:32] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [06:06:41] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:14:11] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 12870.68 ms [06:14:21] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.84 ms [06:16:22] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2107803 [06:27:42] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [06:27:52] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [06:27:52] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [06:28:21] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:31:02] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/nginx] [06:36:01] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:36:45] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [06:53:06] andrewbogott: replication was still stop on labsdb1003, I have started it [06:56:01] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:01] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [06:57:02] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [06:57:52] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:06:02] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [07:06:12] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:14:59] no_justification: I have downtimed the check ^ till tomorrow as per your SAL entry [07:17:09] marostegui: Tyvm [07:21:11] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:21:16] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [07:21:31] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [07:27:22] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [07:28:21] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [07:42:43] * akosiaris looking into conf2002 [07:43:21] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [07:43:26] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time [07:43:42] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [07:44:05] what's up? [07:44:33] !log ran puppet on conf2002, etcdmirror-conftool-eqiad-wmnet got started again [07:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:57] Raft Internal Error : etcdserver: request timed out, possibly due to previous leader failure [07:45:05] raft crapped it's pants for while and [07:45:14] Dec 03 07:18:45 conf2002 etcdmirror-conftool-eqiad-wmnet[14990]: [etcd-mirror] CRITICAL: Generic error: unsupported operand type(s) for +: 'NoneType' and 'int' [07:45:17] there's your bug [07:45:29] without a stacktrace unfortunately [07:46:15] I 'll create a quick task on phab about it [07:46:25] and then be on my way [07:46:35] k [07:46:38] happy weekend [07:50:17] 10Operations: etcd-mirror failure - https://phabricator.wikimedia.org/T181920#3806629 (10akosiaris) [08:02:22] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:02:32] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:01] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:03:11] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [08:03:12] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [08:03:12] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 64%, RTA = 11394.88 ms [08:03:12] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:21] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 2.52 ms [08:03:21] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 2.43 ms [08:03:22] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [08:03:22] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:05:41] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 2.28 ms [08:21:42] PROBLEM - Host actinium is DOWN: PING CRITICAL - Packet loss = 100% [08:22:11] PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100% [08:22:12] PROBLEM - Host fermium is DOWN: PING CRITICAL - Packet loss = 100% [08:22:31] PROBLEM - Host dysprosium is DOWN: PING CRITICAL - Packet loss = 100% [08:22:32] PROBLEM - Host etcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:32] PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:32] PROBLEM - Host neon is DOWN: PING CRITICAL - Packet loss = 100% [08:22:41] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Zotero alive) timed out before a response was received [08:22:42] PROBLEM - Host etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:22:51] o.O [08:23:02] PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:21] PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:32] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [08:23:42] PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:10] I guess ganeti1006 is down [08:25:29] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: Reboot of dumps hosts - https://phabricator.wikimedia.org/T180127#3806669 (10ArielGlenn) 05Open>03Resolved dataset1001 done, closing. [08:25:31] RECOVERY - Host actinium is UP: PING OK - Packet loss = 0%, RTA = 2.31 ms [08:25:41] RECOVERY - Host etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 1.98 ms [08:25:41] RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 2.08 ms [08:25:41] RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms [08:25:41] RECOVERY - Host neon is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [08:25:41] RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 2.14 ms [08:25:42] RECOVERY - Host fermium is UP: PING OK - Packet loss = 0%, RTA = 1.94 ms [08:25:42] RECOVERY - Host dysprosium is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [08:25:43] RECOVERY - Host etcd1005 is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [08:26:12] RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [08:26:41] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:28:50] (03PS4) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) [08:29:23] (03CR) 10jerkins-bot: [V: 04-1] simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn) [08:30:51] RECOVERY - Host bohrium is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [08:31:47] (03PS5) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) [08:38:00] (03CR) 10ArielGlenn: [C: 032] simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn) [08:43:42] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=Allvar-cache_type=miscvar-status_type=5 [08:44:47] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806675 (10Strainu) @Reedy , I think that the most important is to know if the Ops team is actively working on a solution for this and if not, whe... [09:01:47] (03PS1) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) [09:02:23] (03CR) 10jerkins-bot: [V: 04-1] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn) [09:03:39] (03PS2) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) [09:04:15] (03CR) 10jerkins-bot: [V: 04-1] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn) [09:04:59] the 50x were all due to piwik since bohrium went down [09:07:24] (03PS3) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) [09:07:57] (03CR) 10jerkins-bot: [V: 04-1] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn) [09:09:57] (03PS4) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) [09:10:44] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3806684 (10elukey) ganeti1006 just froze fora couple of minutes: ``` [Sun Dec 3 08:48:11 2017] BUG: Bad page state in process qemu-system-x86 pfn:6d8600 [Sun Dec 3 08:48:11 20... [09:19:14] (03PS5) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) [10:33:10] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806717 (10MarcoAurelio) If the Board has approved the deletion of them, why don't we simply delete them the old way? Jesus... [10:48:26] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3398639 (10MarcoAurelio) [11:11:03] (03PS1) 10MarcoAurelio: Delete mowiki and mowiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394846 (https://phabricator.wikimedia.org/T181923) [11:13:52] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table wikidatawiki.page_props: Cant find record in page_props, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001571, end_log_pos 27136838 [11:17:41] marostegui, andas por ahí? [11:26:02] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 881.47 seconds [12:06:12] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:06:17] !log Fix dbstore1002 replication [12:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:26] TabbyCat: hola hola ;.) [12:13:55] :D [12:14:16] marostegui, thanks so much for de queries. It's funny that localnames count == localusers apparently [12:14:20] *the [12:16:06] Yeah, just replied. Can't help much there as I don't know the details for the internals of those tables [12:16:19] I guess we need someone with more MW knoweldge [12:16:25] TabbyCat: Voy a comer! Hasta luego! [12:16:29] 10Operations, 10ops-eqiad: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121#3806885 (10akosiaris) The memtester did not trigger anything since Dec 3. It's already on loop 7 and given we need the capacity to safely empty ganeti1006, I had to stop it to mig... [12:16:34] marostegui, que aproveche, no te atragantes [12:16:51] TabbyCat: Lo intentaré, pero no prometo nada! :-) [12:16:55] xD [12:17:01] que me quedo sin mi dba favorito [12:17:01] xD [12:17:09] Cuando hay hambre...se come muy rapido y te puedes atrangatar [12:17:10] (now that jynus isn't watching) [12:17:14] haha [12:17:18] hehe [12:17:25] !log empty ganeti1006, it had issues this morning per T181121 [12:17:27] Hasta luego! [12:17:32] ciao [12:17:33] damn hardware... [12:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:34] T181121: Possible memory errors on ganeti1005, ganeti1006, ganeti1008 - https://phabricator.wikimedia.org/T181121 [12:17:59] only ganeti1007 hasn't exhibited anything yet [12:18:32] akosiaris: right, you just triggered it by saying it [12:20:22] heh, at least that will be reassuring in a way [12:20:47] all hardware will be easily declared problematic by then. Given they are the same bunch at least [12:21:21] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 82.60 seconds [12:27:25] it's a different issue this time around though... or at least the logs are different [12:27:51] anyway, I 've emptied the box, I 'll have a look later on [13:28:52] PROBLEM - NTP on sca1004 is CRITICAL: NTP CRITICAL: Offset unknown [13:58:52] RECOVERY - NTP on sca1004 is OK: NTP OK: Offset -0.01305687428 secs [14:24:43] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3806990 (10Reedy) >>! In T169450#3806675, @Strainu wrote: > @Reedy , I think that the most important is to know if the Ops team is actively workin... [14:49:35] (03PS6) 10ArielGlenn: cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) [14:58:28] (03CR) 10ArielGlenn: [C: 032] cleanup xml/sql dumps on all hosts that have them [puppet] - 10https://gerrit.wikimedia.org/r/394842 (https://phabricator.wikimedia.org/T181895) (owner: 10ArielGlenn) [15:07:20] 10Operations: bond eth interfaces on ms1001 - https://phabricator.wikimedia.org/T89829#3807030 (10ArielGlenn) 05Open>03declined This task is obsolete, as ms1001 is heading for decommission, and labstore hosts will be picking up this role. [15:07:24] 10Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917#3807032 (10ArielGlenn) [15:26:49] (03PS1) 10ArielGlenn: move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 [15:27:21] (03CR) 10jerkins-bot: [V: 04-1] move various misc dump cron jobs to new nfs server [puppet] - 10https://gerrit.wikimedia.org/r/394852 (owner: 10ArielGlenn) [15:33:10] (03Abandoned) 10ArielGlenn: copy of completed dump files plus metadata from dumpsdata to web server [puppet] - 10https://gerrit.wikimedia.org/r/374606 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [15:41:57] (03Abandoned) 10ArielGlenn: setup for dumpsdata hosts to serve dumps work area via nfs to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/366308 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [15:46:22] PROBLEM - HHVM rendering on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:12] RECOVERY - HHVM rendering on mw2143 is OK: HTTP OK: HTTP/1.1 200 OK - 74110 bytes in 0.302 second response time [16:26:21] (03CR) 10Jon Harald Søby: "Per @Framawiki, the deployer needs to run this script after the patch has been merged:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby) [16:37:41] (03PS1) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [16:37:58] (03CR) 10jerkins-bot: [V: 04-1] ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) (owner: 10ArielGlenn) [21:21:42] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:50:22] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:42] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:27:01] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 4.779 second response time [22:30:02] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:11] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 278 bytes in 9.893 second response time [22:34:12] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:22] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set [23:02:42] RECOVERY - Disk space on labstore1006 is OK: DISK OK [23:02:42] RECOVERY - DPKG on labstore1006 is OK: All packages OK [23:03:01] RECOVERY - Check systemd state on labstore1006 is OK: OK - running: The system is fully operational [23:03:02] RECOVERY - configured eth on labstore1006 is OK: OK - interfaces up [23:03:11] RECOVERY - dhclient process on labstore1006 is OK: PROCS OK: 0 processes with command name dhclient [23:03:12] RECOVERY - Check size of conntrack table on labstore1006 is OK: OK: nf_conntrack is 0 % full [23:04:31] RECOVERY - HP RAID on labstore1006 is OK: OK: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:1:10, 1E:1:11, 1E:1:12 - Controller: OK - Battery/Capacitor: OK [23:08:52] RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [23:09:52] RECOVERY - IPMI Sensor Status on labstore1006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [23:11:46] PROBLEM - Check size of conntrack table on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - Check systemd state on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - DPKG on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - Disk space on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - configured eth on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - dhclient process on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:11:47] PROBLEM - puppet last run on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:14:16] PROBLEM - HP RAID on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:15:26] PROBLEM - IPMI Sensor Status on labstore1007 is CRITICAL: Return code of 255 is out of bounds [23:20:46] RECOVERY - Check the NTP synchronisation status of timesyncd on labstore1006 is OK: OK: synced at Sun 2017-12-03 23:20:40 UTC. [23:25:15] PROBLEM - Host labstore1007 is DOWN: CRITICAL - Host Unreachable (208.80.155.106) [23:28:15] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 4.279 second response time [23:30:55] RECOVERY - Long running screen/tmux on labstore1006 is OK: OK: No SCREEN or tmux processes detected. [23:32:25] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:33:06] (03PS1) 10EddieGP: varnish: Don't redirect www.$project.org on mobile [puppet] - 10https://gerrit.wikimedia.org/r/394902 (https://phabricator.wikimedia.org/T154026) [23:35:25] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 6.773 second response time [23:38:26] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:16] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.316 second response time [23:54:20] 10Operations, 10Ops-Access-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10Tgr) [23:54:58] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807592 (10Tgr)